Web Scraping for Real Estate Data
Use fastCRW to build property listing pipelines from real estate sites with structured extraction of price, location, and features.
fastCRW powers property data pipelines by extracting listings, prices, and location features from public real estate sites. Zillow, Trulia, and similar boards publish property data in accessible HTML; fastCRW turns it into structured JSON with geocoding integration. The legal boundary is clear: public listings are fair game, but check each site's ToS. You get fresh market data daily, deduplication by address, and a pipeline that beats manual scraping or expensive data aggregators.
Verdict
fastCRW builds real estate data pipelines from public property listing sites. While Zillow's ToS forbids scraping, legal alternatives like Trulia and public real estate aggregators expose rich property data that fastCRW extracts into structured JSON—price, address, beds/baths, square footage, HOA fees, days-on-market. You get a unified property feed across multiple sources, price-tracking over time, and a pipeline that beats manual collection or paying thousands for enterprise data feeds. The tradeoff: respect ToS strictly, understand the legal risks of your chosen data source, and always verify compliance with a lawyer before commercializing.
Why Real Estate Needs Web Scraping
Property data is fragmented across dozens of sites: Zillow, Redfin, Apartments.com, Craigslist, and local MLS systems. Scraping public listings lets you:
- aggregate listings from multiple sources into a single search,
- track price changes and days-on-market over time,
- analyze neighborhood market trends and affordability,
- power listing alerts and investment screening,
- and build a PropTech product without expensive MLS licensing.
Most property sites publish HTML with structured data—microdata, JSON-LD, or plain HTML—that fastCRW extracts into clean JSON ready for your database.
Where fastCRW Helps
| Need | fastCRW Role |
|---|---|
| Multi-source aggregation | crawl multiple real estate sites and merge into unified feed |
| Price extraction | LLM extraction parses "$450,000" and "$2,500/mo" into numeric fields |
| Feature parsing | Extract beds, baths, sqft, lot size, year built, etc. from descriptions |
| Pagination | crawl with maxDepth: 2 loads all listings across search results |
| Price history | Run daily crawls and track price changes by property address |
Architecture Overview
A production real estate pipeline has five stages:
- Discovery: Map the site to find listing URLs and search filters.
- Crawl: Fetch all listings across pages, respecting rate limits.
- Extraction: Parse structured fields (price, address, beds, baths, etc.).
- Normalization: Clean addresses, deduplicate by location, geocode.
- Load: Insert into your property database with timestamp tracking.
fastCRW handles discovery and crawl; your code handles extraction schema, deduplication, and geocoding.
Implementation Walkthrough
Here's a complete Python example that scrapes a real estate listing site, extracts properties, deduplicates by address, and geocodes for map visualization.
Step 1: Install dependencies
uv venv
uv pip install requests geopy python-dotenv
Step 2: Define your property extraction schema
import json
import os
import requests
import time
from typing import Optional
from datetime import datetime
from difflib import SequenceMatcher
# Load API keys
FASTCRW_API_KEY = os.getenv("FASTCRW_API_KEY")
FASTCRW_BASE_URL = "https://api.fastcrw.com/v1"
# Define the extraction schema for property listings
PROPERTY_EXTRACTION_SCHEMA = {
"type": "object",
"properties": {
"address": {
"type": "string",
"description": "Full street address (street, city, state, ZIP)"
},
"price": {
"type": "number",
"description": "Price in USD (for sale) or monthly rent (for rental)"
},
"bedrooms": {
"type": "integer",
"description": "Number of bedrooms"
},
"bathrooms": {
"type": "number",
"description": "Number of bathrooms (may be decimal like 2.5)"
},
"square_feet": {
"type": "integer",
"description": "Square footage of the property"
},
"property_type": {
"type": "string",
"enum": ["House", "Condo", "Townhouse", "Apartment", "Land", "Other"],
"description": "Type of property"
},
"lot_size": {
"type": "string",
"description": "Lot size (e.g., '0.25 acres' or '10,000 sqft')"
},
"year_built": {
"type": "integer",
"description": "Year the property was built"
},
"hoa_fee": {
"type": "number",
"description": "Monthly HOA fee if applicable"
},
"days_on_market": {
"type": "integer",
"description": "Days the property has been listed"
},
"listing_url": {
"type": "string",
"description": "URL of the listing detail page"
}
},
"required": ["address", "price", "bedrooms", "bathrooms"]
}
def crawl_real_estate_site(
search_url: str,
max_pages: int = 5
) -> list[dict]:
"""
Crawl a real estate site to discover listing pages.
Args:
search_url: Search result URL from Zillow/Redfin/etc
max_pages: Max pagination pages to crawl
Returns:
List of properties with extracted data
"""
print(f"Mapping real estate site: {search_url}")
# Step 1: Map the site to find listing structure
map_payload = {
"url": search_url,
"maxDepth": 2, # Follow pagination
}
map_response = requests.post(
f"{FASTCRW_BASE_URL}/map",
headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
json=map_payload
)
map_response.raise_for_status()
discovered_urls = map_response.json().get("urls", [])
print(f"Found {len(discovered_urls)} listing pages")
# Step 2: Crawl listings and extract data
all_properties = []
for url in discovered_urls[:max_pages]:
try:
print(f"Crawling {url}")
crawl_payload = {
"url": url,
"formats": ["markdown"],
"extraction": {
"schema": PROPERTY_EXTRACTION_SCHEMA
}
}
response = requests.post(
f"{FASTCRW_BASE_URL}/crawl",
headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
json=crawl_payload,
timeout=60
)
response.raise_for_status()
crawl_result = response.json()
# Extract properties from crawl result
if "data" in crawl_result:
for item in crawl_result["data"]:
if "extractedData" in item:
prop = item["extractedData"]
prop["source_url"] = item.get("url")
prop["scraped_at"] = datetime.utcnow().isoformat()
all_properties.append(prop)
time.sleep(2) # Respect rate limits
except Exception as e:
print(f"Error crawling {url}: {e}")
return all_properties
def normalize_address(address: str) -> str:
"""
Normalize address for deduplication.
Args:
address: Raw address string
Returns:
Normalized address
"""
if not address:
return ""
# Convert to uppercase, strip whitespace
normalized = address.strip().upper()
# Normalize direction abbreviations
normalized = normalized.replace("ST.", "STREET")
normalized = normalized.replace("AVE.", "AVENUE")
normalized = normalized.replace("BLVD.", "BOULEVARD")
normalized = normalized.replace("RD.", "ROAD")
normalized = normalized.replace("CT.", "COURT")
normalized = normalized.replace("LN.", "LANE")
normalized = normalized.replace("DR.", "DRIVE")
normalized = normalized.replace("PL.", "PLACE")
return normalized
def deduplicate_properties(properties: list[dict]) -> list[dict]:
"""
Deduplicate property listings by address with fuzzy matching.
Args:
properties: List of property records
Returns:
Deduplicated list (keeps first occurrence)
"""
seen = {}
unique_properties = []
for prop in properties:
norm_address = normalize_address(prop.get("address", ""))
# Check for exact match
if norm_address in seen:
continue
# Check for fuzzy match (90%+ similarity)
found_match = False
for seen_address in seen.keys():
similarity = SequenceMatcher(None, norm_address, seen_address).ratio()
if similarity >= 0.90:
found_match = True
break
if not found_match:
seen[norm_address] = True
unique_properties.append(prop)
return unique_properties
def geocode_properties(properties: list[dict]) -> list[dict]:
"""
Add latitude/longitude to properties via geocoding.
Args:
properties: List of property records
Returns:
Properties with latitude and longitude added
"""
try:
from geopy.geocoders import Nominatim
geocoder = Nominatim(user_agent="realestate_scraper")
for prop in properties:
address = prop.get("address")
if not address:
continue
try:
location = geocoder.geocode(address, timeout=5)
if location:
prop["latitude"] = location.latitude
prop["longitude"] = location.longitude
else:
print(f"Geocoding failed for: {address}")
prop["latitude"] = None
prop["longitude"] = None
time.sleep(0.5) # Rate limit geocoding API
except Exception as e:
print(f"Geocoding error for {address}: {e}")
prop["latitude"] = None
prop["longitude"] = None
return properties
except ImportError:
print("Geopy not installed. Skipping geocoding.")
return properties
def detect_price_changes(
current_properties: list[dict],
previous_properties: list[dict]
) -> list[dict]:
"""
Detect price changes between two scrape runs.
Args:
current_properties: Latest properties
previous_properties: Previous scrape results
Returns:
Properties with price_change and price_change_pct fields
"""
# Build lookup of previous prices by address
prev_prices = {}
for prop in previous_properties:
addr = normalize_address(prop.get("address", ""))
if addr:
prev_prices[addr] = prop.get("price")
# Calculate deltas
for prop in current_properties:
addr = normalize_address(prop.get("address", ""))
current_price = prop.get("price")
if addr in prev_prices and current_price:
prev_price = prev_prices[addr]
if prev_price:
price_change = current_price - prev_price
price_change_pct = (price_change / prev_price) * 100
prop["price_change"] = price_change
prop["price_change_pct"] = round(price_change_pct, 2)
return current_properties
def format_for_database(properties: list[dict]) -> str:
"""
Format properties as JSONL for database load.
Args:
properties: List of property records
Returns:
JSONL string
"""
return "\n".join(json.dumps(prop) for prop in properties)
# Example: Main scraping pipeline
if __name__ == "__main__":
print("Real Estate Data Scraping Pipeline")
print("-" * 50)
# Example search URL (replace with your target site)
# For testing, use a real real estate search URL
search_url = "https://www.redfin.com/homes-for-sale/94102"
# Step 1: Crawl listings
properties = crawl_real_estate_site(
search_url=search_url,
max_pages=3
)
print(f"\nExtracted {len(properties)} listings")
# Step 2: Deduplicate by address
unique_properties = deduplicate_properties(properties)
print(f"After dedup: {len(unique_properties)} unique properties")
# Step 3: Geocode (add lat/long for mapping)
geocoded_properties = geocode_properties(unique_properties)
# Step 4: Format for database
jsonl_output = format_for_database(geocoded_properties)
# Save to file
with open("properties.jsonl", "w") as f:
f.write(jsonl_output)
print(f"Saved to properties.jsonl")
# Show first result
if geocoded_properties:
print("\nExample property:")
print(json.dumps(geocoded_properties[0], indent=2))
Step 3: Run the pipeline
export FASTCRW_API_KEY="your_api_key_here"
uv run python scrape_real_estate.py
This creates properties.jsonl with deduplicated, geocoded properties ready to load into your real estate database.
Production Considerations
Handling Dynamic Real Estate Sites
Modern real estate sites (Zillow, Redfin) load listings via JavaScript. fastCRW's Chrome rendering (Business plan) handles this automatically. Request formats: ["markdown"] and fastCRW will render the page before extracting.
Address Normalization
Properties are often listed with slight address variations:
- "123 Main Street" vs "123 Main St"
- "San Francisco, CA" vs "San Francisco, California"
- Unit numbers: "456 Oak Ave #200" vs "456 Oak Ave Unit 200"
Use fuzzy matching (SequenceMatcher at 90%+ similarity) to catch duplicates despite minor variations.
Price Change Tracking
Run your scraper daily and store results with timestamps. Calculate price deltas to detect:
- Price reductions (common before market shifts)
- Price increases (popular neighborhoods)
- Rapid fluctuations (listing errors or market anomalies)
This is valuable data for investors and agents.
Rate Limiting and Proxy Rotation
Real estate sites monitor for scrapers. Protect yourself:
- Use delays between requests (2+ seconds recommended).
- Rotate User-Agent headers to vary browser signatures.
- Use fastCRW's residential proxy option (Business plan) for large-scale scrapes.
- Scrape during off-peak hours (late night, early morning).
Geocoding Efficiency
Geocoding is slow. Cache results:
- Store address → lat/long mappings in a separate cache table.
- Before geocoding a property, check if you've already geocoded that address.
- Use batch geocoding APIs if your volume is high.
Legal and Ethical Notes
ToS Compliance by Site
- Zillow: ToS explicitly forbids scraping. High legal risk.
- Redfin: ToS is more permissive; scraping is likely allowed but check current terms.
- Trulia: Permissive; generally safe to scrape.
- Apartments.com: Allows scraping in robots.txt; generally permitted.
- Craigslist: Forbids scraping explicitly; do not scrape.
- Public MLS: Legally restricted; requires broker license or partnership.
Always check robots.txt and ToS before scraping a new site.
Fair Use in Real Estate
Scraped property data can be used for:
- Personal investment analysis: Screening properties for purchase.
- Market research: Analyzing trends, affordability, neighborhood stats.
- Competitive intelligence: Comparing your listings to market comps.
Prohibited uses:
- Republishing: Copying Zillow listings onto your own site violates ToS.
- Commercial redistribution: Selling property data without licensing.
- Competitor impersonation: Presenting others' listings as your own.
FAQ
Q: Can I use scraped data to build a real estate portal like Zillow?
A: No. Zillow's ToS forbids scraping, and republishing their listings violates copyright. You can build a portal by:
- Licensing data from MLS (requires broker partnership)
- Licensing from real estate data providers
- Aggregating from open/permissive sources
- Partnering with individual agents who own their listings
Q: How do I integrate with MLS?
A: MLS data is controlled by regional real estate boards and requires a broker license or partnership agreement. Contact your local real estate board for MLS API access, or partner with a MLS data provider like CoreLogic or Black Knight.
Q: Can I use this for appraisals?
A: Scraped data is useful for market research but not official appraisals. Appraisals require licensed appraisers using official comps data. Use scraped data for preliminary analysis, but hire a professional for formal appraisals.
Q: How do I handle rental vs. for-sale listings differently?
A: Both use the same extraction schema, but price semantics differ. For rentals, price is monthly; for sales, it's total purchase price. Add a listing_type field ("For Sale" or "For Rent") to disambiguate in your pipeline.
Q: What about private listing networks?
A: Pocket Listings and other private networks are not public; scraping them is illegal. Stick to public MLS and listing sites.
Q: Can I detect fraud (e.g., flipped listings)?
A: Yes. Track address + listing history. If the same property relists within weeks with a much higher price, it may indicate wholesaling or fraud. Flag these for manual review.
Related resources
- Firecrawl alternatives — direct comparison for property-listing extraction pipelines
- Scrapfly alternatives — proxy-rotation option for listing sites with strong anti-bot
- LangChain integration — feed property pages into a retrieval/Q&A pipeline
- n8n integration — schedule MLS-adjacent crawls without writing infra
- Market research — broader pattern for territory and macro analysis
- Price monitoring — the price-tracking subset, applied to housing
Continue exploring
More from Use Cases
Web Scraping for News Aggregation
Web Scraping for Price Monitoring
Web Scraping for LLM Training Data
Use fastCRW to crawl domains into markdown, deduplicate, filter quality, and output JSONL for fine-tuning and RAG datasets.
Web Scraping for Brand Monitoring
Monitor brand mentions across the web using fastCRW search + scrape: discover mentions on news sites, blogs, and forums, extract sentiment, and get real-time alerts.
Web Scraping for Competitor Monitoring
Use fastCRW to track competitor websites, pricing pages, feature launches, and content changes in real-time.
Related hubs