Use Cases/Use Case / Real Estate Data

Web Scraping for Real Estate Data

Use fastCRW to build property listing pipelines from real estate sites with structured extraction of price, location, and features.

Published

May 12, 2026

Updated

May 12, 2026

Verdict

fastCRW builds real estate data pipelines from public property listing sites. While Zillow's ToS forbids scraping, legal alternatives like Trulia and public real estate aggregators expose rich property data that fastCRW extracts into structured JSON—price, address, beds/baths, square footage, HOA fees, days-on-market. You get a unified property feed across multiple sources, price-tracking over time, and a pipeline that beats manual collection or paying thousands for enterprise data feeds. The tradeoff: respect ToS strictly, understand the legal risks of your chosen data source, and always verify compliance with a lawyer before commercializing.

Why Real Estate Needs Web Scraping

Property data is fragmented across dozens of sites: Zillow, Redfin, Apartments.com, Craigslist, and local MLS systems. Scraping public listings lets you:

aggregate listings from multiple sources into a single search,
track price changes and days-on-market over time,
analyze neighborhood market trends and affordability,
power listing alerts and investment screening,
and build a PropTech product without expensive MLS licensing.

Most property sites publish HTML with structured data—microdata, JSON-LD, or plain HTML—that fastCRW extracts into clean JSON ready for your database.

Where fastCRW Helps

Need	fastCRW Role
Multi-source aggregation	`crawl` multiple real estate sites and merge into unified feed
Price extraction	LLM extraction parses "$450,000" and "$2,500/mo" into numeric fields
Feature parsing	Extract beds, baths, sqft, lot size, year built, etc. from descriptions
Pagination	`crawl` with `maxDepth: 2` loads all listings across search results
Price history	Run daily crawls and track price changes by property address

Architecture Overview

A production real estate pipeline has five stages:

Discovery: Map the site to find listing URLs and search filters.
Crawl: Fetch all listings across pages, respecting rate limits.
Extraction: Parse structured fields (price, address, beds, baths, etc.).
Normalization: Clean addresses, deduplicate by location, geocode.
Load: Insert into your property database with timestamp tracking.

fastCRW handles discovery and crawl; your code handles extraction schema, deduplication, and geocoding.

Implementation Walkthrough

Here's a complete Python example that scrapes a real estate listing site, extracts properties, deduplicates by address, and geocodes for map visualization.

Step 1: Install dependencies

uv venv
uv pip install requests geopy python-dotenv

Step 2: Define your property extraction schema

import json
import os
import requests
import time
from typing import Optional
from datetime import datetime
from difflib import SequenceMatcher

# Load API keys
FASTCRW_API_KEY = os.getenv("FASTCRW_API_KEY")
FASTCRW_BASE_URL = "https://api.fastcrw.com/v1"

# Define the extraction schema for property listings
PROPERTY_EXTRACTION_SCHEMA = {
    "type": "object",
    "properties": {
        "address": {
            "type": "string",
            "description": "Full street address (street, city, state, ZIP)"
        },
        "price": {
            "type": "number",
            "description": "Price in USD (for sale) or monthly rent (for rental)"
        },
        "bedrooms": {
            "type": "integer",
            "description": "Number of bedrooms"
        },
        "bathrooms": {
            "type": "number",
            "description": "Number of bathrooms (may be decimal like 2.5)"
        },
        "square_feet": {
            "type": "integer",
            "description": "Square footage of the property"
        },
        "property_type": {
            "type": "string",
            "enum": ["House", "Condo", "Townhouse", "Apartment", "Land", "Other"],
            "description": "Type of property"
        },
        "lot_size": {
            "type": "string",
            "description": "Lot size (e.g., '0.25 acres' or '10,000 sqft')"
        },
        "year_built": {
            "type": "integer",
            "description": "Year the property was built"
        },
        "hoa_fee": {
            "type": "number",
            "description": "Monthly HOA fee if applicable"
        },
        "days_on_market": {
            "type": "integer",
            "description": "Days the property has been listed"
        },
        "listing_url": {
            "type": "string",
            "description": "URL of the listing detail page"
        }
    },
    "required": ["address", "price", "bedrooms", "bathrooms"]
}

def crawl_real_estate_site(
    search_url: str,
    max_pages: int = 5
) -> list[dict]:
    """
    Crawl a real estate site to discover listing pages.
    
    Args:
        search_url: Search result URL from Zillow/Redfin/etc
        max_pages: Max pagination pages to crawl
    
    Returns:
        List of properties with extracted data
    """
    print(f"Mapping real estate site: {search_url}")
    
    # Step 1: Map the site to find listing structure
    map_payload = {
        "url": search_url,
        "maxDepth": 2,  # Follow pagination
    }
    
    map_response = requests.post(
        f"{FASTCRW_BASE_URL}/map",
        headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
        json=map_payload
    )
    map_response.raise_for_status()
    discovered_urls = map_response.json().get("urls", [])
    
    print(f"Found {len(discovered_urls)} listing pages")
    
    # Step 2: Crawl listings and extract data
    all_properties = []
    
    for url in discovered_urls[:max_pages]:
        try:
            print(f"Crawling {url}")
            
            crawl_payload = {
                "url": url,
                "formats": ["markdown"],
                "extraction": {
                    "schema": PROPERTY_EXTRACTION_SCHEMA
                }
            }
            
            response = requests.post(
                f"{FASTCRW_BASE_URL}/crawl",
                headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
                json=crawl_payload,
                timeout=60
            )
            response.raise_for_status()
            
            crawl_result = response.json()
            
            # Extract properties from crawl result
            if "data" in crawl_result:
                for item in crawl_result["data"]:
                    if "extractedData" in item:
                        prop = item["extractedData"]
                        prop["source_url"] = item.get("url")
                        prop["scraped_at"] = datetime.utcnow().isoformat()
                        all_properties.append(prop)
            
            time.sleep(2)  # Respect rate limits
            
        except Exception as e:
            print(f"Error crawling {url}: {e}")
    
    return all_properties

def normalize_address(address: str) -> str:
    """
    Normalize address for deduplication.
    
    Args:
        address: Raw address string
    
    Returns:
        Normalized address
    """
    if not address:
        return ""
    
    # Convert to uppercase, strip whitespace
    normalized = address.strip().upper()
    
    # Normalize direction abbreviations
    normalized = normalized.replace("ST.", "STREET")
    normalized = normalized.replace("AVE.", "AVENUE")
    normalized = normalized.replace("BLVD.", "BOULEVARD")
    normalized = normalized.replace("RD.", "ROAD")
    normalized = normalized.replace("CT.", "COURT")
    normalized = normalized.replace("LN.", "LANE")
    normalized = normalized.replace("DR.", "DRIVE")
    normalized = normalized.replace("PL.", "PLACE")
    
    return normalized

def deduplicate_properties(properties: list[dict]) -> list[dict]:
    """
    Deduplicate property listings by address with fuzzy matching.
    
    Args:
        properties: List of property records
    
    Returns:
        Deduplicated list (keeps first occurrence)
    """
    seen = {}
    unique_properties = []
    
    for prop in properties:
        norm_address = normalize_address(prop.get("address", ""))
        
        # Check for exact match
        if norm_address in seen:
            continue
        
        # Check for fuzzy match (90%+ similarity)
        found_match = False
        for seen_address in seen.keys():
            similarity = SequenceMatcher(None, norm_address, seen_address).ratio()
            if similarity >= 0.90:
                found_match = True
                break
        
        if not found_match:
            seen[norm_address] = True
            unique_properties.append(prop)
    
    return unique_properties

def geocode_properties(properties: list[dict]) -> list[dict]:
    """
    Add latitude/longitude to properties via geocoding.
    
    Args:
        properties: List of property records
    
    Returns:
        Properties with latitude and longitude added
    """
    try:
        from geopy.geocoders import Nominatim
        
        geocoder = Nominatim(user_agent="realestate_scraper")
        
        for prop in properties:
            address = prop.get("address")
            if not address:
                continue
            
            try:
                location = geocoder.geocode(address, timeout=5)
                if location:
                    prop["latitude"] = location.latitude
                    prop["longitude"] = location.longitude
                else:
                    print(f"Geocoding failed for: {address}")
                    prop["latitude"] = None
                    prop["longitude"] = None
                
                time.sleep(0.5)  # Rate limit geocoding API
                
            except Exception as e:
                print(f"Geocoding error for {address}: {e}")
                prop["latitude"] = None
                prop["longitude"] = None
        
        return properties
    
    except ImportError:
        print("Geopy not installed. Skipping geocoding.")
        return properties

def detect_price_changes(
    current_properties: list[dict],
    previous_properties: list[dict]
) -> list[dict]:
    """
    Detect price changes between two scrape runs.
    
    Args:
        current_properties: Latest properties
        previous_properties: Previous scrape results
    
    Returns:
        Properties with price_change and price_change_pct fields
    """
    # Build lookup of previous prices by address
    prev_prices = {}
    for prop in previous_properties:
        addr = normalize_address(prop.get("address", ""))
        if addr:
            prev_prices[addr] = prop.get("price")
    
    # Calculate deltas
    for prop in current_properties:
        addr = normalize_address(prop.get("address", ""))
        current_price = prop.get("price")
        
        if addr in prev_prices and current_price:
            prev_price = prev_prices[addr]
            if prev_price:
                price_change = current_price - prev_price
                price_change_pct = (price_change / prev_price) * 100
                
                prop["price_change"] = price_change
                prop["price_change_pct"] = round(price_change_pct, 2)
    
    return current_properties

def format_for_database(properties: list[dict]) -> str:
    """
    Format properties as JSONL for database load.
    
    Args:
        properties: List of property records
    
    Returns:
        JSONL string
    """
    return "\n".join(json.dumps(prop) for prop in properties)

# Example: Main scraping pipeline
if __name__ == "__main__":
    print("Real Estate Data Scraping Pipeline")
    print("-" * 50)
    
    # Example search URL (replace with your target site)
    # For testing, use a real real estate search URL
    search_url = "https://www.redfin.com/homes-for-sale/94102"
    
    # Step 1: Crawl listings
    properties = crawl_real_estate_site(
        search_url=search_url,
        max_pages=3
    )
    
    print(f"\nExtracted {len(properties)} listings")
    
    # Step 2: Deduplicate by address
    unique_properties = deduplicate_properties(properties)
    print(f"After dedup: {len(unique_properties)} unique properties")
    
    # Step 3: Geocode (add lat/long for mapping)
    geocoded_properties = geocode_properties(unique_properties)
    
    # Step 4: Format for database
    jsonl_output = format_for_database(geocoded_properties)
    
    # Save to file
    with open("properties.jsonl", "w") as f:
        f.write(jsonl_output)
    
    print(f"Saved to properties.jsonl")
    
    # Show first result
    if geocoded_properties:
        print("\nExample property:")
        print(json.dumps(geocoded_properties[0], indent=2))

Step 3: Run the pipeline

export FASTCRW_API_KEY="your_api_key_here"
uv run python scrape_real_estate.py

This creates properties.jsonl with deduplicated, geocoded properties ready to load into your real estate database.

Production Considerations

Handling Dynamic Real Estate Sites

Modern real estate sites (Zillow, Redfin) load listings via JavaScript. fastCRW's Chrome rendering (Business plan) handles this automatically. Request formats: ["markdown"] and fastCRW will render the page before extracting.

Address Normalization

Properties are often listed with slight address variations:

"123 Main Street" vs "123 Main St"
"San Francisco, CA" vs "San Francisco, California"
Unit numbers: "456 Oak Ave #200" vs "456 Oak Ave Unit 200"

Use fuzzy matching (SequenceMatcher at 90%+ similarity) to catch duplicates despite minor variations.

Price Change Tracking

Run your scraper daily and store results with timestamps. Calculate price deltas to detect:

Price reductions (common before market shifts)
Price increases (popular neighborhoods)
Rapid fluctuations (listing errors or market anomalies)

This is valuable data for investors and agents.

Rate Limiting and Proxy Rotation

Real estate sites monitor for scrapers. Protect yourself:

Use delays between requests (2+ seconds recommended).
Rotate User-Agent headers to vary browser signatures.
Use fastCRW's residential proxy option (Business plan) for large-scale scrapes.
Scrape during off-peak hours (late night, early morning).

Geocoding Efficiency

Geocoding is slow. Cache results:

Store address → lat/long mappings in a separate cache table.
Before geocoding a property, check if you've already geocoded that address.
Use batch geocoding APIs if your volume is high.

Legal and Ethical Notes

ToS Compliance by Site

Zillow: ToS explicitly forbids scraping. High legal risk.
Redfin: ToS is more permissive; scraping is likely allowed but check current terms.
Trulia: Permissive; generally safe to scrape.
Apartments.com: Allows scraping in robots.txt; generally permitted.
Craigslist: Forbids scraping explicitly; do not scrape.
Public MLS: Legally restricted; requires broker license or partnership.

Always check robots.txt and ToS before scraping a new site.

Fair Use in Real Estate

Scraped property data can be used for:

Personal investment analysis: Screening properties for purchase.
Market research: Analyzing trends, affordability, neighborhood stats.
Competitive intelligence: Comparing your listings to market comps.

Prohibited uses:

Republishing: Copying Zillow listings onto your own site violates ToS.
Commercial redistribution: Selling property data without licensing.
Competitor impersonation: Presenting others' listings as your own.

FAQ

Q: Can I use scraped data to build a real estate portal like Zillow?

A: No. Zillow's ToS forbids scraping, and republishing their listings violates copyright. You can build a portal by:

Licensing data from MLS (requires broker partnership)
Licensing from real estate data providers
Aggregating from open/permissive sources
Partnering with individual agents who own their listings

Q: How do I integrate with MLS?

A: MLS data is controlled by regional real estate boards and requires a broker license or partnership agreement. Contact your local real estate board for MLS API access, or partner with a MLS data provider like CoreLogic or Black Knight.

Q: Can I use this for appraisals?

A: Scraped data is useful for market research but not official appraisals. Appraisals require licensed appraisers using official comps data. Use scraped data for preliminary analysis, but hire a professional for formal appraisals.

Q: How do I handle rental vs. for-sale listings differently?

A: Both use the same extraction schema, but price semantics differ. For rentals, price is monthly; for sales, it's total purchase price. Add a listing_type field ("For Sale" or "For Rent") to disambiguate in your pipeline.

Q: What about private listing networks?

A: Pocket Listings and other private networks are not public; scraping them is illegal. Stick to public MLS and listing sites.

Q: Can I detect fraud (e.g., flipped listings)?

A: Yes. Track address + listing history. If the same property relists within weeks with a much higher price, it may indicate wholesaling or fraud. Flag these for manual review.

Firecrawl alternatives — direct comparison for property-listing extraction pipelines
Scrapfly alternatives — proxy-rotation option for listing sites with strong anti-bot
LangChain integration — feed property pages into a retrieval/Q&A pipeline
n8n integration — schedule MLS-adjacent crawls without writing infra
Market research — broader pattern for territory and macro analysis
Price monitoring — the price-tracking subset, applied to housing

Sources

fastCRW Extraction Documentation

https://docs.fastcrw.com/features/extraction

Zillow Terms of Service — Scraping Policy

https://www.zillow.com/tos

OpenCage Geocoding API

https://opencagedata.com

Real Estate Data Best Practices

https://www.realestateai.org/best-practices

Python Geopy Library for Geocoding

https://geopy.readthedocs.io

FAQ

Is scraping Zillow legal?

Zillow's ToS prohibits scraping. However, the legal landscape is murky—some courts have ruled that public data can be scraped, while others enforce ToS strictly. Safer alternatives: Trulia, Redfin (more permissive), public MLS feeds, or Zillow's official API (paid, premium data access).

How do I get MLS (Multiple Listing Service) data legally?

MLS data is controlled by regional real estate boards. You need a broker license or partnership to access it. Zillow and Redfin license MLS data. Scraping MLS-licensed sites violates their agreements. Use official MLS APIs or partner with a real estate broker.

Can I add price-tracking functionality?

Yes. Scrape the same properties weekly or monthly, store results with timestamps, and calculate price deltas. Track days-on-market, price reductions, and market trends. This is valuable for agents and investors.

How do I handle listings with missing data?

Use LLM extraction with optional fields (not all properties have HOA fees, for example). Set defaults (null for missing numeric fields, empty string for text) and filter incomplete listings in your pipeline if needed.

What's the best way to deduplicate properties?

Deduplicate by full address (street + city + state + zip). Use fuzzy matching for minor address variations. Store a canonical address format to catch 'Main St' vs 'Main Street' differences.

Can I use this data to build a real estate portal?

Yes, if you're aggregating public listings or using official APIs. Scraping Zillow and republishing violates their ToS. Build on top of legal data sources: open MLS APIs, public real estate boards, or official property databases.

How do I handle mobile-rendered real estate sites?

Some real estate sites load listings via JavaScript on mobile. Use fastCRW's Chrome rendering (Business plan) to load dynamic content, then extract as normal.

How do I implement geocoding in my pipeline?

fastCRW returns raw addresses as text. To geocode (convert to lat/lng), use a downstream service like Google Maps Geocoding API, Mapbox, or Nominatim (OpenStreetMap) after extraction. Cache geocoded results to avoid re-querying the same addresses.

Recommended next step

Run a live scrape before you commit.

Use the hosted demo to test scrape, crawl, or map output with fastCRW semantics.

Try Playground

Continue exploring

More from Use Cases

View all use cases

Previous in Use Cases

Web Scraping for News Aggregation

Next in Use Cases

Web Scraping for Price Monitoring

Use Cases

Web Scraping for LLM Training Data

Use fastCRW to crawl domains into markdown, deduplicate, filter quality, and output JSONL for fine-tuning and RAG datasets.

llm training data web scrapingCrawl entire domains into clean markdown with automatic deduplication

Use Cases

Web Scraping for Brand Monitoring

Monitor brand mentions across the web using fastCRW search + scrape: discover mentions on news sites, blogs, and forums, extract sentiment, and get real-time alerts.

brand monitoring web scrapingSearch the web for brand mentions using `/v1/search` endpoint

Use Cases

Web Scraping for Competitor Monitoring

Use fastCRW to track competitor websites, pricing pages, feature launches, and content changes in real-time.

competitor monitoringScrape competitor pricing, features, and content changes

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.

Web Scraping for Real Estate Data

Verdict

Why Real Estate Needs Web Scraping

Where fastCRW Helps

Architecture Overview

Implementation Walkthrough

Step 1: Install dependencies

Step 2: Define your property extraction schema

Step 3: Run the pipeline

Production Considerations

Handling Dynamic Real Estate Sites

Address Normalization

Price Change Tracking

Rate Limiting and Proxy Rotation

Geocoding Efficiency

Legal and Ethical Notes

ToS Compliance by Site

Fair Use in Real Estate

FAQ

Related resources

More from Use Cases

Web Scraping for LLM Training Data

Web Scraping for Brand Monitoring

Web Scraping for Competitor Monitoring

Keep the crawl path moving

Alternatives

Benchmarks

Docs