Web Scraping for Lead Enrichment
Use fastCRW to scrape company pages, directories, and public profiles for firmographic and contact data, then push structured fields into your CRM — fresher than vendor databases, cheaper per record, and automatable for AI SDR workflows.
fastCRW makes lead enrichment from public web sources practical at any scale — from a nightly batch on 500 CRM records to a real-time AI SDR pipeline processing thousands of inbound leads per day. The single-binary architecture means you can self-host the entire enrichment loop inside your own VPC with zero third-party data egress.
Why Lead Enrichment Needs Web Scraping
CRM records decay faster than most sales teams realize. People change roles every 18–24 months on average. Companies rebrand, pivot products, and update pricing. The contact page you scraped last quarter may already show a different head of sales.
Third-party enrichment databases (Apollo, ZoomInfo, Clearbit) help, but they solve a different problem: they aggregate data across many sources and resell it on a per-record basis. That model introduces two friction points that matter at scale:
- Freshness lag. A provider's database reflects when they last crawled a company's site — often weeks or months ago. If your ICP (ideal customer profile) is in fast-moving sectors like AI or fintech, stale data costs pipeline.
- Cost per record. At volume — enriching 50,000 inbound leads per month — per-record fees compound quickly. Scraping public company sites directly costs fractions of a cent per domain in server fees.
Direct scraping gives your pipeline:
- Current firmographic data from the company's own About page — headcount, location, product lines, founding year
- Fresh team structure from leadership and team pages — who's the new VP of Engineering, when did they hire a Head of Partnerships
- Product and pricing signals from pricing pages — did they add an enterprise tier, drop a plan, change the headline pitch
- Technology signals from page source and meta tags — what stack they're building on, which integrations they advertise
The tradeoff is clear: you own the pipeline, but you also own the freshness. For verified phone and email, a specialist provider is still better. For everything publicly visible on a company's website, scraping wins on cost and recency.
Where fastCRW Fits in the Enrichment Stack
| Enrichment need | fastCRW endpoint | Notes |
|---|---|---|
| Discover relevant pages on a domain | /v1/map | Returns all URLs — about, team, pricing, careers, contact |
| Pull structured firmographics | /v1/scrape + jsonSchema | 5 credits per extract; 1 credit for raw markdown |
| Crawl an entire company site | /v1/crawl | Respects maxDepth and maxPages caps; 1 credit per page |
| Search for a company by name when you lack the domain | /v1/search | Returns top results with URLs; 1 credit per query |
| Render JS-heavy SPAs and dynamic team pages | Auto renderer | All renderers (http, lightpanda, Chrome) cost 1 credit |
Typical Enrichment Pipeline
A production enrichment pipeline for a B2B sales team has five stages:
1. CRM export
Pull domains of unenriched or stale records from your CRM. A Salesforce SOQL query like SELECT Website FROM Account WHERE LastEnrichmentDate < LAST_N_DAYS:30 gives you the input list.
2. URL discovery
Call /v1/map on each domain to get all page URLs. Filter for pages matching patterns like /about, /team, /leadership, /company, /pricing, /contact. Most company sites have predictable URL structures; map once per domain per month.
3. Structured extraction
For each relevant page URL, call /v1/scrape with formats: ["json"] and a jsonSchema defining the CRM fields you want. The extraction LLM fills the schema from the page content. No HTML parsing, no custom selectors — one schema definition covers the entire company.
4. Merge and deduplicate A company's About page and Team page often overlap (both mention the CEO, both show the HQ location). Merge extracted records from multiple pages per domain, preferring the more specific value when fields conflict.
5. CRM write-back Push enriched fields to CRM via API with a freshness timestamp. Patch only changed fields to avoid triggering downstream automation on unchanged records.
Implementation: Lead Enrichment Pipeline
curl — scrape a company about page with schema extraction
curl -X POST https://api.fastcrw.com/v1/scrape \
-H "Authorization: Bearer $CRW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-company.com/about",
"formats": ["json"],
"jsonSchema": {
"type": "object",
"properties": {
"company_name": { "type": "string" },
"industry": { "type": "string" },
"employee_count": { "type": "string" },
"hq_location": { "type": "string" },
"founded_year": { "type": "string" },
"description": { "type": "string", "description": "1-2 sentence company description" },
"key_products": { "type": "array", "items": { "type": "string" } }
},
"required": ["company_name", "description"]
}
}'
Python — full enrichment loop across a domain list
import requests
import json
from datetime import datetime
from typing import Optional
CRW_API_KEY = "your-api-key"
CRW_BASE_URL = "https://api.fastcrw.com/v1"
FIRMOGRAPHIC_SCHEMA = {
"type": "object",
"properties": {
"company_name": { "type": "string" },
"industry": { "type": "string" },
"employee_count": { "type": "string", "description": "Headcount or range, e.g. '50-200'" },
"hq_location": { "type": "string" },
"founded_year": { "type": "string" },
"description": { "type": "string", "description": "1-2 sentence company description" },
"key_products": { "type": "array", "items": { "type": "string" } },
"tech_stack": { "type": "array", "items": { "type": "string" } },
},
"required": ["company_name"]
}
TEAM_SCHEMA = {
"type": "object",
"properties": {
"executives": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"title": { "type": "string" },
"linkedin": { "type": "string" }
}
}
}
}
}
def map_domain(domain: str) -> list[str]:
"""Discover all pages on a company domain."""
resp = requests.post(
f"{CRW_BASE_URL}/map",
json={"url": f"https://{domain}"},
headers={"Authorization": f"Bearer {CRW_API_KEY}"},
timeout=30
)
resp.raise_for_status()
return resp.json().get("urls", [])
def scrape_with_schema(url: str, schema: dict) -> Optional[dict]:
"""Scrape a URL and extract structured fields via JSON schema."""
resp = requests.post(
f"{CRW_BASE_URL}/scrape",
json={
"url": url,
"formats": ["json"],
"jsonSchema": schema
},
headers={"Authorization": f"Bearer {CRW_API_KEY}"},
timeout=30
)
if resp.status_code == 200:
return resp.json().get("data", {}).get("json")
return None
def filter_relevant_pages(urls: list[str]) -> dict[str, list[str]]:
"""Bucket discovered URLs by page type."""
buckets: dict[str, list[str]] = {"about": [], "team": [], "pricing": []}
patterns = {
"about": ["/about", "/company", "/our-story", "/who-we-are"],
"team": ["/team", "/leadership", "/people", "/founders"],
"pricing": ["/pricing", "/plans", "/packages"],
}
for url in urls:
path = url.lower()
for bucket, keywords in patterns.items():
if any(kw in path for kw in keywords):
buckets[bucket].append(url)
return buckets
def enrich_domain(domain: str) -> dict:
"""Run the full enrichment pipeline for one company domain."""
result: dict = {"domain": domain, "enriched_at": datetime.utcnow().isoformat()}
# Step 1: Discover pages
all_urls = map_domain(domain)
buckets = filter_relevant_pages(all_urls)
# Step 2: Extract firmographics from about pages
for url in buckets["about"][:2]: # cap at 2 about pages
data = scrape_with_schema(url, FIRMOGRAPHIC_SCHEMA)
if data:
result.update({k: v for k, v in data.items() if v and k not in result})
# Step 3: Extract team data from team pages
for url in buckets["team"][:1]:
data = scrape_with_schema(url, TEAM_SCHEMA)
if data and "executives" in data:
result["executives"] = data["executives"]
return result
def enrich_domain_list(domains: list[str]) -> list[dict]:
"""Enrich a list of company domains (serial for demo; parallelize in prod)."""
enriched = []
for i, domain in enumerate(domains):
print(f"[{i+1}/{len(domains)}] Enriching {domain}...")
try:
record = enrich_domain(domain)
enriched.append(record)
except Exception as e:
print(f" Error enriching {domain}: {e}")
enriched.append({"domain": domain, "error": str(e)})
return enriched
if __name__ == "__main__":
domains = [
"stripe.com",
"notion.so",
"linear.app",
"vercel.com",
"supabase.com",
]
results = enrich_domain_list(domains)
print("\n=== ENRICHMENT RESULTS ===")
for r in results:
print(f"\n{r.get('domain')}:")
print(f" Company: {r.get('company_name', 'N/A')}")
print(f" Industry: {r.get('industry', 'N/A')}")
print(f" Headcount: {r.get('employee_count', 'N/A')}")
print(f" Location: {r.get('hq_location', 'N/A')}")
execs = r.get("executives", [])
if execs:
print(f" Executives: {len(execs)} found")
JavaScript/TypeScript — enrichment worker for a queue-based pipeline
const CRW_API_KEY = process.env.CRW_API_KEY!;
const CRW_BASE_URL = "https://api.fastcrw.com/v1";
const firmographicSchema = {
type: "object",
properties: {
company_name: { type: "string" },
industry: { type: "string" },
employee_count: { type: "string" },
hq_location: { type: "string" },
description: { type: "string" },
key_products: { type: "array", items: { type: "string" } },
},
required: ["company_name"],
};
async function mapDomain(domain: string): Promise<string[]> {
const res = await fetch(`${CRW_BASE_URL}/map`, {
method: "POST",
headers: {
Authorization: `Bearer ${CRW_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ url: `https://${domain}` }),
});
const data = await res.json();
return data.urls ?? [];
}
async function scrapeWithSchema(
url: string,
schema: object
): Promise<Record<string, unknown> | null> {
const res = await fetch(`${CRW_BASE_URL}/scrape`, {
method: "POST",
headers: {
Authorization: `Bearer ${CRW_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ url, formats: ["json"], jsonSchema: schema }),
});
if (!res.ok) return null;
const data = await res.json();
return data?.data?.json ?? null;
}
async function enrichDomain(domain: string) {
const urls = await mapDomain(domain);
const aboutUrl = urls.find((u) =>
["/about", "/company", "/our-story"].some((kw) => u.toLowerCase().includes(kw))
);
if (!aboutUrl) return { domain, error: "no about page found" };
const firmographics = await scrapeWithSchema(aboutUrl, firmographicSchema);
return {
domain,
enriched_at: new Date().toISOString(),
...firmographics,
};
}
// Parallel enrichment with concurrency cap
async function enrichBatch(domains: string[], concurrency = 5) {
const results: unknown[] = [];
for (let i = 0; i < domains.length; i += concurrency) {
const batch = domains.slice(i, i + concurrency);
const batchResults = await Promise.allSettled(batch.map(enrichDomain));
results.push(...batchResults.map((r) => (r.status === "fulfilled" ? r.value : { error: r.reason })));
}
return results;
}
// Example
const domains = ["stripe.com", "notion.so", "linear.app"];
enrichBatch(domains, 5).then((results) => console.log(JSON.stringify(results, null, 2)));
AI SDR Workflows: Enrichment as a Real-Time Signal
AI sales development representatives (AI SDRs) have made lead enrichment a real-time requirement rather than a nightly batch job. When a prospect submits a demo request, the AI SDR needs firmographic context within seconds to personalize the first email.
fastCRW fits this pattern well because:
- Low latency for single-domain lookups. A single
/v1/scrapecall completes at p50 in 1914 ms (benchmark against Firecrawl's public 1,000-URL dataset,diagnose_3way.py, 2026-05-08 — CANONICAL-FACTS §5). Map + scrape two pages takes ~5 seconds end-to-end — well within the window before a welcome email sends. - Self-hostable for zero data egress. For regulated industries, the enrichment data (company descriptions, executive names) never needs to leave your VPC. Spin up fastCRW on an internal server and call it from your AI SDR service directly.
- Firecrawl-compatible API. If your AI SDR already integrates Firecrawl, swapping to fastCRW is a base-URL change and an API key swap — no code changes needed.
A typical AI SDR enrichment flow on inbound:
Inbound form submit
→ fastCRW /v1/map(domain) → filter about/team URLs
→ fastCRW /v1/scrape(about_url, jsonSchema) → firmographics JSON
→ AI SDR prompt: "Personalize this email for {company_name}, a {employee_count}-person {industry} company based in {hq_location} that builds {description}."
→ Send personalized email
Production Considerations
Parallelism and rate limits
Serial enrichment is fine for nightly batches of a few hundred domains. For larger volumes, parallelize /v1/scrape with a concurrency cap that stays within your plan's rate limits. At 10 concurrent workers and ~2 s per scrape, you can process ~8,600 domains per day — enough for a large enterprise SDR team's monthly inbound.
Handling failed scrapes
Not every company website has a clean About page. Implement retry logic with exponential backoff for 5xx responses. If a domain returns consistent 403s or has heavy bot protection, fall back to a web search: call /v1/search with the company name to find directory listings or press coverage that surface the same firmographic fields.
Caching map results
/v1/map results for a given domain are stable for weeks. Cache the URL list in Redis or your database with a 14-day TTL. Only re-map when the enrichment timestamp crosses your freshness threshold. This cuts map credit usage significantly for monthly re-enrichment cycles.
Schema versioning
As your CRM schema evolves, version your extraction schemas. Store schema_version alongside enriched records so you know which fields were extracted under which schema and can backfill when you add new fields.
Self-hosting for data control
If your enrichment pipeline handles leads from regulated industries (healthcare, fintech, legal), self-host fastCRW inside your own infrastructure. The single ~8 MB binary image (CANONICAL-FACTS §7) runs on a $5–10/month VPS. Target company websites are public, but the enriched records — your CRM data — never need to transit a third-party API.
Credit Cost Estimates
All credit costs from CANONICAL-FACTS §3 (marketing/CANONICAL-FACTS.md, verified 2026-05-29):
| Operation | Credits | Notes |
|---|---|---|
/v1/map per domain | 1 | Discover all URLs on a company site |
/v1/scrape (markdown only) | 1 | Raw page content, any renderer (http, lightpanda, Chrome) |
/v1/scrape with formats: ["json"] | 5 | Structured extraction via LLM |
Example: enrich 500 CRM records/month
- 500 map calls = 500 credits
- 500 × 2 page scrapes per domain (about + team) with extraction = 500 × 2 × 5 = 5,000 credits
- Total: ~5,500 credits/month → fits the Hobby plan ($13/mo launch price, 3,000 credits) if you scrape 1 page per domain, or Standard plan ($69/mo, 100,000 credits) for 2-page extraction
For nightly re-enrichment of a 5,000-account CRM with 2 pages each:
- 5,000 map calls + 10,000 extract scrapes = 5,000 + 50,000 = 55,000 credits/month
- Fits the Standard plan ($69/mo — launch price, was $99, 100,000 credits)
Pricing derives from PLAN_DISPLAY in src/lib/plans-client.ts. Launch pricing was in effect through 2026-06-01; check /pricing for current rates.
Self-hosting is free — you pay only your server. A $10/month VPS handles hundreds of concurrent enrichment requests.
Good Fits for Lead Enrichment
- B2B sales teams enriching inbound demo requests before the first SDR touchpoint
- AI SDR workflows that personalize outreach in real time using firmographic context
- Marketing teams building firmographic audience segments for ABM campaigns
- Recruiting teams mapping org structures at target companies before outreach
- Competitive intelligence teams tracking headcount changes, new hires, and role shifts at key accounts
- RevOps teams maintaining CRM hygiene by detecting stale records and triggering re-enrichment
- Platform teams building an internal enrichment microservice that other tools consume
When to Pick Something Else
fastCRW is the right tool when you need publicly visible data from company websites. There are cases where other approaches win:
- Verified contact data (email, phone): Use a dedicated provider (Apollo, Hunter, Clearbit) that maintains opt-in databases. Public company pages rarely list individual emails, and scraping email addresses raises compliance concerns under GDPR and CAN-SPAM.
- Social graph data (LinkedIn connections, follower counts): LinkedIn Terms of Service prohibit scraping. Use their official partner APIs or a compliant data provider.
- Behind-login content: If the data lives behind an authenticated portal (a client dashboard, a private directory), fastCRW cannot reach it without credentials, and doing so may violate the site's ToS.
- Firmographic at extreme volume with no ops budget: At millions of records per month, a dedicated enrichment API with bulk pricing may be more cost-effective than operating your own scraping infrastructure.
Related Resources
- Firecrawl alternative — how fastCRW compares on accuracy and cost for enrichment workloads
- Apify alternative — when a full actor platform vs. a simple scraping API is the right call
- Competitor monitoring — track product and pricing changes at key accounts over time
- MCP integration — use fastCRW tools directly from Claude or any MCP-compatible AI agent
- LangChain integration — chain enrichment scrapes with LLM summarization in a LangChain pipeline
- Pricing — current plan credits and rates
Continue exploring
More from Use Cases
Web Scraping for Price Monitoring
Web Scraping for Market Research
Web Scraping for Real Estate Data
Use fastCRW to build property listing pipelines from public real estate sites with structured extraction of price, location, beds/baths, and features.
Web Scraping for Content Aggregation
Build a comprehensive content aggregation pipeline with fastCRW: discover URLs across any source, scrape full-text pages into clean markdown, deduplicate, extract structured metadata, and feed a data pipeline dashboard — all via a single Firecrawl-compatible API.
Web Scraping for RAG and AI Agent Training Data
Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.
Related hubs
