Skip to main content
Use Cases/Use Case / Lead Enrichment

Web Scraping for Lead Enrichment

Use fastCRW to scrape company pages, directories, and public profiles for firmographic and contact data, then push structured fields into your CRM — fresher than vendor databases, cheaper per record, and automatable for AI SDR workflows.

Published
April 4, 2026
Updated
June 24, 2026
Category
use cases
Verdict

fastCRW makes lead enrichment from public web sources practical at any scale — from a nightly batch on 500 CRM records to a real-time AI SDR pipeline processing thousands of inbound leads per day. The single-binary architecture means you can self-host the entire enrichment loop inside your own VPC with zero third-party data egress.

Scrape company pages for fresh firmographic data direct from the sourceExtract structured contact and team data with JSON schema extractionBatch-enrich via the crawl endpoint across entire company domainsRun inside your own network — no data leaves your VPC unless you choose managed

Why Lead Enrichment Needs Web Scraping

CRM records decay faster than most sales teams realize. People change roles every 18–24 months on average. Companies rebrand, pivot products, and update pricing. The contact page you scraped last quarter may already show a different head of sales.

Third-party enrichment databases (Apollo, ZoomInfo, Clearbit) help, but they solve a different problem: they aggregate data across many sources and resell it on a per-record basis. That model introduces two friction points that matter at scale:

  1. Freshness lag. A provider's database reflects when they last crawled a company's site — often weeks or months ago. If your ICP (ideal customer profile) is in fast-moving sectors like AI or fintech, stale data costs pipeline.
  2. Cost per record. At volume — enriching 50,000 inbound leads per month — per-record fees compound quickly. Scraping public company sites directly costs fractions of a cent per domain in server fees.

Direct scraping gives your pipeline:

  • Current firmographic data from the company's own About page — headcount, location, product lines, founding year
  • Fresh team structure from leadership and team pages — who's the new VP of Engineering, when did they hire a Head of Partnerships
  • Product and pricing signals from pricing pages — did they add an enterprise tier, drop a plan, change the headline pitch
  • Technology signals from page source and meta tags — what stack they're building on, which integrations they advertise

The tradeoff is clear: you own the pipeline, but you also own the freshness. For verified phone and email, a specialist provider is still better. For everything publicly visible on a company's website, scraping wins on cost and recency.

Where fastCRW Fits in the Enrichment Stack

Enrichment needfastCRW endpointNotes
Discover relevant pages on a domain/v1/mapReturns all URLs — about, team, pricing, careers, contact
Pull structured firmographics/v1/scrape + jsonSchema5 credits per extract; 1 credit for raw markdown
Crawl an entire company site/v1/crawlRespects maxDepth and maxPages caps; 1 credit per page
Search for a company by name when you lack the domain/v1/searchReturns top results with URLs; 1 credit per query
Render JS-heavy SPAs and dynamic team pagesAuto rendererAll renderers (http, lightpanda, Chrome) cost 1 credit

Typical Enrichment Pipeline

A production enrichment pipeline for a B2B sales team has five stages:

1. CRM export Pull domains of unenriched or stale records from your CRM. A Salesforce SOQL query like SELECT Website FROM Account WHERE LastEnrichmentDate < LAST_N_DAYS:30 gives you the input list.

2. URL discovery Call /v1/map on each domain to get all page URLs. Filter for pages matching patterns like /about, /team, /leadership, /company, /pricing, /contact. Most company sites have predictable URL structures; map once per domain per month.

3. Structured extraction For each relevant page URL, call /v1/scrape with formats: ["json"] and a jsonSchema defining the CRM fields you want. The extraction LLM fills the schema from the page content. No HTML parsing, no custom selectors — one schema definition covers the entire company.

4. Merge and deduplicate A company's About page and Team page often overlap (both mention the CEO, both show the HQ location). Merge extracted records from multiple pages per domain, preferring the more specific value when fields conflict.

5. CRM write-back Push enriched fields to CRM via API with a freshness timestamp. Patch only changed fields to avoid triggering downstream automation on unchanged records.

Implementation: Lead Enrichment Pipeline

curl — scrape a company about page with schema extraction

curl -X POST https://api.fastcrw.com/v1/scrape \
  -H "Authorization: Bearer $CRW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-company.com/about",
    "formats": ["json"],
    "jsonSchema": {
      "type": "object",
      "properties": {
        "company_name":    { "type": "string" },
        "industry":        { "type": "string" },
        "employee_count":  { "type": "string" },
        "hq_location":     { "type": "string" },
        "founded_year":    { "type": "string" },
        "description":     { "type": "string", "description": "1-2 sentence company description" },
        "key_products":    { "type": "array", "items": { "type": "string" } }
      },
      "required": ["company_name", "description"]
    }
  }'

Python — full enrichment loop across a domain list

import requests
import json
from datetime import datetime
from typing import Optional

CRW_API_KEY = "your-api-key"
CRW_BASE_URL = "https://api.fastcrw.com/v1"

FIRMOGRAPHIC_SCHEMA = {
    "type": "object",
    "properties": {
        "company_name":   { "type": "string" },
        "industry":       { "type": "string" },
        "employee_count": { "type": "string", "description": "Headcount or range, e.g. '50-200'" },
        "hq_location":    { "type": "string" },
        "founded_year":   { "type": "string" },
        "description":    { "type": "string", "description": "1-2 sentence company description" },
        "key_products":   { "type": "array", "items": { "type": "string" } },
        "tech_stack":     { "type": "array", "items": { "type": "string" } },
    },
    "required": ["company_name"]
}

TEAM_SCHEMA = {
    "type": "object",
    "properties": {
        "executives": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name":       { "type": "string" },
                    "title":      { "type": "string" },
                    "linkedin":   { "type": "string" }
                }
            }
        }
    }
}

def map_domain(domain: str) -> list[str]:
    """Discover all pages on a company domain."""
    resp = requests.post(
        f"{CRW_BASE_URL}/map",
        json={"url": f"https://{domain}"},
        headers={"Authorization": f"Bearer {CRW_API_KEY}"},
        timeout=30
    )
    resp.raise_for_status()
    return resp.json().get("urls", [])

def scrape_with_schema(url: str, schema: dict) -> Optional[dict]:
    """Scrape a URL and extract structured fields via JSON schema."""
    resp = requests.post(
        f"{CRW_BASE_URL}/scrape",
        json={
            "url": url,
            "formats": ["json"],
            "jsonSchema": schema
        },
        headers={"Authorization": f"Bearer {CRW_API_KEY}"},
        timeout=30
    )
    if resp.status_code == 200:
        return resp.json().get("data", {}).get("json")
    return None

def filter_relevant_pages(urls: list[str]) -> dict[str, list[str]]:
    """Bucket discovered URLs by page type."""
    buckets: dict[str, list[str]] = {"about": [], "team": [], "pricing": []}
    patterns = {
        "about":   ["/about", "/company", "/our-story", "/who-we-are"],
        "team":    ["/team", "/leadership", "/people", "/founders"],
        "pricing": ["/pricing", "/plans", "/packages"],
    }
    for url in urls:
        path = url.lower()
        for bucket, keywords in patterns.items():
            if any(kw in path for kw in keywords):
                buckets[bucket].append(url)
    return buckets

def enrich_domain(domain: str) -> dict:
    """Run the full enrichment pipeline for one company domain."""
    result: dict = {"domain": domain, "enriched_at": datetime.utcnow().isoformat()}

    # Step 1: Discover pages
    all_urls = map_domain(domain)
    buckets = filter_relevant_pages(all_urls)

    # Step 2: Extract firmographics from about pages
    for url in buckets["about"][:2]:  # cap at 2 about pages
        data = scrape_with_schema(url, FIRMOGRAPHIC_SCHEMA)
        if data:
            result.update({k: v for k, v in data.items() if v and k not in result})

    # Step 3: Extract team data from team pages
    for url in buckets["team"][:1]:
        data = scrape_with_schema(url, TEAM_SCHEMA)
        if data and "executives" in data:
            result["executives"] = data["executives"]

    return result

def enrich_domain_list(domains: list[str]) -> list[dict]:
    """Enrich a list of company domains (serial for demo; parallelize in prod)."""
    enriched = []
    for i, domain in enumerate(domains):
        print(f"[{i+1}/{len(domains)}] Enriching {domain}...")
        try:
            record = enrich_domain(domain)
            enriched.append(record)
        except Exception as e:
            print(f"  Error enriching {domain}: {e}")
            enriched.append({"domain": domain, "error": str(e)})
    return enriched

if __name__ == "__main__":
    domains = [
        "stripe.com",
        "notion.so",
        "linear.app",
        "vercel.com",
        "supabase.com",
    ]

    results = enrich_domain_list(domains)

    print("\n=== ENRICHMENT RESULTS ===")
    for r in results:
        print(f"\n{r.get('domain')}:")
        print(f"  Company:   {r.get('company_name', 'N/A')}")
        print(f"  Industry:  {r.get('industry', 'N/A')}")
        print(f"  Headcount: {r.get('employee_count', 'N/A')}")
        print(f"  Location:  {r.get('hq_location', 'N/A')}")
        execs = r.get("executives", [])
        if execs:
            print(f"  Executives: {len(execs)} found")

JavaScript/TypeScript — enrichment worker for a queue-based pipeline

const CRW_API_KEY = process.env.CRW_API_KEY!;
const CRW_BASE_URL = "https://api.fastcrw.com/v1";

const firmographicSchema = {
  type: "object",
  properties: {
    company_name:   { type: "string" },
    industry:       { type: "string" },
    employee_count: { type: "string" },
    hq_location:    { type: "string" },
    description:    { type: "string" },
    key_products:   { type: "array", items: { type: "string" } },
  },
  required: ["company_name"],
};

async function mapDomain(domain: string): Promise<string[]> {
  const res = await fetch(`${CRW_BASE_URL}/map`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${CRW_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url: `https://${domain}` }),
  });
  const data = await res.json();
  return data.urls ?? [];
}

async function scrapeWithSchema(
  url: string,
  schema: object
): Promise<Record<string, unknown> | null> {
  const res = await fetch(`${CRW_BASE_URL}/scrape`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${CRW_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, formats: ["json"], jsonSchema: schema }),
  });
  if (!res.ok) return null;
  const data = await res.json();
  return data?.data?.json ?? null;
}

async function enrichDomain(domain: string) {
  const urls = await mapDomain(domain);
  const aboutUrl = urls.find((u) =>
    ["/about", "/company", "/our-story"].some((kw) => u.toLowerCase().includes(kw))
  );

  if (!aboutUrl) return { domain, error: "no about page found" };

  const firmographics = await scrapeWithSchema(aboutUrl, firmographicSchema);
  return {
    domain,
    enriched_at: new Date().toISOString(),
    ...firmographics,
  };
}

// Parallel enrichment with concurrency cap
async function enrichBatch(domains: string[], concurrency = 5) {
  const results: unknown[] = [];
  for (let i = 0; i < domains.length; i += concurrency) {
    const batch = domains.slice(i, i + concurrency);
    const batchResults = await Promise.allSettled(batch.map(enrichDomain));
    results.push(...batchResults.map((r) => (r.status === "fulfilled" ? r.value : { error: r.reason })));
  }
  return results;
}

// Example
const domains = ["stripe.com", "notion.so", "linear.app"];
enrichBatch(domains, 5).then((results) => console.log(JSON.stringify(results, null, 2)));

AI SDR Workflows: Enrichment as a Real-Time Signal

AI sales development representatives (AI SDRs) have made lead enrichment a real-time requirement rather than a nightly batch job. When a prospect submits a demo request, the AI SDR needs firmographic context within seconds to personalize the first email.

fastCRW fits this pattern well because:

  • Low latency for single-domain lookups. A single /v1/scrape call completes at p50 in 1914 ms (benchmark against Firecrawl's public 1,000-URL dataset, diagnose_3way.py, 2026-05-08 — CANONICAL-FACTS §5). Map + scrape two pages takes ~5 seconds end-to-end — well within the window before a welcome email sends.
  • Self-hostable for zero data egress. For regulated industries, the enrichment data (company descriptions, executive names) never needs to leave your VPC. Spin up fastCRW on an internal server and call it from your AI SDR service directly.
  • Firecrawl-compatible API. If your AI SDR already integrates Firecrawl, swapping to fastCRW is a base-URL change and an API key swap — no code changes needed.

A typical AI SDR enrichment flow on inbound:

Inbound form submit
  → fastCRW /v1/map(domain) → filter about/team URLs
  → fastCRW /v1/scrape(about_url, jsonSchema) → firmographics JSON
  → AI SDR prompt: "Personalize this email for {company_name}, a {employee_count}-person {industry} company based in {hq_location} that builds {description}."
  → Send personalized email

Production Considerations

Parallelism and rate limits

Serial enrichment is fine for nightly batches of a few hundred domains. For larger volumes, parallelize /v1/scrape with a concurrency cap that stays within your plan's rate limits. At 10 concurrent workers and ~2 s per scrape, you can process ~8,600 domains per day — enough for a large enterprise SDR team's monthly inbound.

Handling failed scrapes

Not every company website has a clean About page. Implement retry logic with exponential backoff for 5xx responses. If a domain returns consistent 403s or has heavy bot protection, fall back to a web search: call /v1/search with the company name to find directory listings or press coverage that surface the same firmographic fields.

Caching map results

/v1/map results for a given domain are stable for weeks. Cache the URL list in Redis or your database with a 14-day TTL. Only re-map when the enrichment timestamp crosses your freshness threshold. This cuts map credit usage significantly for monthly re-enrichment cycles.

Schema versioning

As your CRM schema evolves, version your extraction schemas. Store schema_version alongside enriched records so you know which fields were extracted under which schema and can backfill when you add new fields.

Self-hosting for data control

If your enrichment pipeline handles leads from regulated industries (healthcare, fintech, legal), self-host fastCRW inside your own infrastructure. The single ~8 MB binary image (CANONICAL-FACTS §7) runs on a $5–10/month VPS. Target company websites are public, but the enriched records — your CRM data — never need to transit a third-party API.

Credit Cost Estimates

All credit costs from CANONICAL-FACTS §3 (marketing/CANONICAL-FACTS.md, verified 2026-05-29):

OperationCreditsNotes
/v1/map per domain1Discover all URLs on a company site
/v1/scrape (markdown only)1Raw page content, any renderer (http, lightpanda, Chrome)
/v1/scrape with formats: ["json"]5Structured extraction via LLM

Example: enrich 500 CRM records/month

  • 500 map calls = 500 credits
  • 500 × 2 page scrapes per domain (about + team) with extraction = 500 × 2 × 5 = 5,000 credits
  • Total: ~5,500 credits/month → fits the Hobby plan ($13/mo launch price, 3,000 credits) if you scrape 1 page per domain, or Standard plan ($69/mo, 100,000 credits) for 2-page extraction

For nightly re-enrichment of a 5,000-account CRM with 2 pages each:

  • 5,000 map calls + 10,000 extract scrapes = 5,000 + 50,000 = 55,000 credits/month
  • Fits the Standard plan ($69/mo — launch price, was $99, 100,000 credits)

Pricing derives from PLAN_DISPLAY in src/lib/plans-client.ts. Launch pricing was in effect through 2026-06-01; check /pricing for current rates.

Self-hosting is free — you pay only your server. A $10/month VPS handles hundreds of concurrent enrichment requests.

Good Fits for Lead Enrichment

  • B2B sales teams enriching inbound demo requests before the first SDR touchpoint
  • AI SDR workflows that personalize outreach in real time using firmographic context
  • Marketing teams building firmographic audience segments for ABM campaigns
  • Recruiting teams mapping org structures at target companies before outreach
  • Competitive intelligence teams tracking headcount changes, new hires, and role shifts at key accounts
  • RevOps teams maintaining CRM hygiene by detecting stale records and triggering re-enrichment
  • Platform teams building an internal enrichment microservice that other tools consume

When to Pick Something Else

fastCRW is the right tool when you need publicly visible data from company websites. There are cases where other approaches win:

  • Verified contact data (email, phone): Use a dedicated provider (Apollo, Hunter, Clearbit) that maintains opt-in databases. Public company pages rarely list individual emails, and scraping email addresses raises compliance concerns under GDPR and CAN-SPAM.
  • Social graph data (LinkedIn connections, follower counts): LinkedIn Terms of Service prohibit scraping. Use their official partner APIs or a compliant data provider.
  • Behind-login content: If the data lives behind an authenticated portal (a client dashboard, a private directory), fastCRW cannot reach it without credentials, and doing so may violate the site's ToS.
  • Firmographic at extreme volume with no ops budget: At millions of records per month, a dedicated enrichment API with bulk pricing may be more cost-effective than operating your own scraping infrastructure.

Continue exploring

More from Use Cases

View all use cases

Related hubs

Keep the crawl path moving