Skip to main content
Use Cases/Use Case / Job Board Scraping

Web Scraping for Job Board Data

Use fastCRW to scrape job listings from public boards and build recruiting pipelines with structured data extraction.

Published
May 12, 2026
Updated
May 12, 2026
Category
use cases
Verdict

fastCRW excels at turning job listings into structured hiring data. Public boards like Indeed and Glassdoor expose their content in HTML; fastCRW extracts titles, companies, locations, salaries, and descriptions at scale. LinkedIn forbids scraping by ToS—respect that boundary. For legal public boards, you get a recruiting pipeline that updates continuously without the per-record cost of traditional job APIs.

Extract job title, company, location, salary, and job description from public listingsCrawl job boards at scale with built-in rate limiting and ToS respectDeduplicate listings and track salary trends across sources

Verdict

fastCRW is the tool for building recruiting data pipelines from public job boards. While LinkedIn is off-limits by ToS, legal public sources like Indeed, Glassdoor, and ZipRecruiter expose rich structured data that fastCRW extracts into hiring-ready JSON. You get fresh listings daily, structured fields out of the box, and a flexible pipeline that costs far less than per-record job APIs. The tradeoff: you must respect ToS and rate limits, and parsing free data is no substitute for official job board APIs where they exist.


Why Job Boards Need Web Scraping

Job APIs from Indeed, LinkedIn, and Glassdoor exist, but they're expensive, rate-limited, or restricted to premium partners. Scraping public job listings directly lets you:

  • build a single unified feed across multiple job boards,
  • refresh listings every few hours without per-request costs,
  • extract salary trends and market intelligence,
  • power a custom recruiting dashboard or job alert system,
  • and unlock wage transparency and equity analysis.

Most job boards publish listings in accessible HTML. fastCRW turns that HTML into clean, structured JSON—job title, company, salary, location, requirements—ready for hiring workflows.


Where fastCRW Helps

NeedfastCRW Role
Multi-board aggregationcrawl multiple job boards and merge results into a single feed
Salary extractionLLM extraction parses "$80K–$120K" into structured min/max fields
Pagination handlingcrawl with maxDepth: 2 follows pagination and loads all listings
Duplicate detectionExtract by URL + company + title, then deduplicate in your pipeline
Recurring updatesSchedule daily crawls to keep job data fresh

Architecture Overview

A typical job board scraping pipeline has four stages:

  1. Discovery: Map the job board to find listing URLs and pagination structure.
  2. Crawl: Fetch all listings across pages, respecting rate limits.
  3. Extraction: Parse structured fields (title, company, salary, location, description).
  4. Load: Deduplicate and insert into your recruiting database.

fastCRW handles discovery and crawl; your code handles extraction schema definition and deduplication.


Implementation Walkthrough

Here's a complete Python example that scrapes Indeed job listings for a role and location, extracts structured data, and deduplicates results.

Step 1: Install dependencies

uv venv
uv pip install requests python-dotenv

Step 2: Define your extraction schema

import json
import os
import requests
import time
from typing import Optional
from datetime import datetime

# Load API key from environment
FASTCRW_API_KEY = os.getenv("FASTCRW_API_KEY")
FASTCRW_BASE_URL = "https://api.fastcrw.com/v1"

# Define the extraction schema for job listings
JOB_EXTRACTION_SCHEMA = {
    "type": "object",
    "properties": {
        "job_title": {
            "type": "string",
            "description": "The job title or position name"
        },
        "company_name": {
            "type": "string",
            "description": "The company or organization name"
        },
        "location": {
            "type": "string",
            "description": "Job location or remote status"
        },
        "salary_min": {
            "type": "number",
            "description": "Minimum salary in USD, or null if not specified"
        },
        "salary_max": {
            "type": "number",
            "description": "Maximum salary in USD, or null if not specified"
        },
        "job_type": {
            "type": "string",
            "enum": ["Full-time", "Part-time", "Contract", "Temporary", "Unknown"],
            "description": "Employment type"
        },
        "description": {
            "type": "string",
            "description": "Job description summary (first 500 chars)"
        },
        "requirements": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Key requirements or qualifications"
        }
    },
    "required": ["job_title", "company_name", "location"]
}

def crawl_job_board(query: str, location: str, max_pages: int = 5) -> list[dict]:
    """
    Crawl Indeed job listings for a given role and location.
    
    Args:
        query: Job title or keywords (e.g., "software engineer")
        location: City or remote (e.g., "San Francisco" or "Remote")
        max_pages: Max pagination pages to crawl (default 5 = ~100 results)
    
    Returns:
        List of crawl URLs to process
    """
    # Build Indeed search URL
    base_url = "https://www.indeed.com/jobs"
    params = {"q": query, "l": location}
    search_url = f"{base_url}?q={query}&l={location}"
    
    print(f"Mapping Indeed search: {search_url}")
    
    # Step 1: Map the job board to find listing pages
    map_payload = {
        "url": search_url,
        "maxDepth": 2,  # Follow pagination
    }
    
    map_response = requests.post(
        f"{FASTCRW_BASE_URL}/map",
        headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
        json=map_payload
    )
    map_response.raise_for_status()
    urls_to_crawl = map_response.json().get("urls", [])
    
    print(f"Found {len(urls_to_crawl)} listing pages")
    return urls_to_crawl[:max_pages]

def extract_jobs(url: str) -> list[dict]:
    """
    Crawl a job listing page and extract structured job data.
    
    Args:
        url: URL of the job listing page
    
    Returns:
        List of extracted job records
    """
    print(f"Crawling {url}")
    
    # Step 2: Crawl the listing page
    crawl_payload = {
        "url": url,
        "formats": ["markdown"],
        "extraction": {
            "schema": JOB_EXTRACTION_SCHEMA
        }
    }
    
    response = requests.post(
        f"{FASTCRW_BASE_URL}/crawl",
        headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
        json=crawl_payload,
        timeout=60
    )
    response.raise_for_status()
    
    crawl_result = response.json()
    
    # Extract jobs from the crawl result
    jobs = []
    if "data" in crawl_result:
        for item in crawl_result["data"]:
            if "extractedData" in item:
                job = item["extractedData"]
                job["source_url"] = item.get("url")
                job["scraped_at"] = datetime.utcnow().isoformat()
                jobs.append(job)
    
    return jobs

def deduplicate_jobs(jobs: list[dict]) -> list[dict]:
    """
    Deduplicate job listings by title, company, and location.
    
    Args:
        jobs: List of job records
    
    Returns:
        Deduplicated list (keeps first occurrence)
    """
    seen = set()
    unique_jobs = []
    
    for job in jobs:
        # Create a dedup key from title, company, location
        key = (
            job.get("job_title", "").lower().strip(),
            job.get("company_name", "").lower().strip(),
            job.get("location", "").lower().strip()
        )
        
        if key not in seen:
            seen.add(key)
            unique_jobs.append(job)
    
    return unique_jobs

def format_for_database(jobs: list[dict]) -> str:
    """
    Format extracted jobs as JSONL (one JSON object per line).
    Ready to load into a database or data warehouse.
    
    Args:
        jobs: List of job records
    
    Returns:
        JSONL string
    """
    return "\n".join(json.dumps(job) for job in jobs)

# Example: Main scraping pipeline
if __name__ == "__main__":
    print("Job Board Scraping Pipeline")
    print("-" * 50)
    
    # Step 1: Crawl the job board
    search_urls = crawl_job_board(
        query="software engineer",
        location="San Francisco, CA",
        max_pages=3
    )
    
    # Step 2: Extract jobs from each page
    all_jobs = []
    for url in search_urls:
        try:
            jobs = extract_jobs(url)
            all_jobs.extend(jobs)
            time.sleep(1)  # Respect rate limits
        except Exception as e:
            print(f"Error crawling {url}: {e}")
    
    print(f"\nExtracted {len(all_jobs)} job listings")
    
    # Step 3: Deduplicate
    unique_jobs = deduplicate_jobs(all_jobs)
    print(f"After dedup: {len(unique_jobs)} unique listings")
    
    # Step 4: Format for database
    jsonl_output = format_for_database(unique_jobs)
    
    # Save to file
    with open("jobs.jsonl", "w") as f:
        f.write(jsonl_output)
    
    print(f"Saved to jobs.jsonl")
    
    # Show first result
    if unique_jobs:
        print("\nExample listing:")
        print(json.dumps(unique_jobs[0], indent=2))

Step 3: Run the scraper

export FASTCRW_API_KEY="your_api_key_here"
uv run python scrape_jobs.py

This creates jobs.jsonl with deduplicated listings ready to load into your recruiting database.


Production Considerations

Rate Limiting and Politeness

Job boards can block aggressive scrapers. fastCRW includes built-in rate limiting, but respect these practices:

  1. Add delays between requests: Use time.sleep(1) or higher between page crawls.
  2. Set a descriptive User-Agent: Let the site know who you are.
  3. Crawl off-peak hours: Early morning or late evening puts less load on their servers.
  4. Rotate IP addresses: For large-scale scrapes, use a proxy service (fastCRW supports residential proxies via Business plan).

Handling JavaScript-Rendered Listings

Some job boards (e.g., newer Glassdoor) load listings via JavaScript. fastCRW's LightPanda rendering (Pro plan) or Chrome rendering (Business plan) handles this automatically. Just request formats: ["markdown"] and let fastCRW render the page.

Deduplication Strategy

Job postings get reposted across multiple boards. Deduplicate by:

  1. Exact match: title + company + location
  2. Fuzzy match: If titles vary slightly, use Levenshtein distance (Python difflib library)
  3. URL dedup: Some boards repost the same job with different URLs

The simple exact-match approach works for most cases.

Updating Old Data

Run your scraper daily to keep job data fresh. Most job boards expire listings after 30–60 days. Track scraped_at timestamps and remove stale entries monthly.


Respect Terms of Service

  • LinkedIn: Explicitly forbids scraping job listings. Do not scrape.
  • Indeed: Permits scraping in their ToS, but rate limiting is enforced. Use reasonable delays.
  • Glassdoor: Scraping is gray; check current ToS. Use with caution.
  • ZipRecruiter: Public listings are scrapable. Generally permissive.
  • Custom careers pages: Usually allowed unless the site's robots.txt forbids it.

Always check a site's robots.txt and ToS before scraping.

Rate Limiting

Aggressive scraping can trigger IP bans. fastCRW manages request throttling, but pair it with:

  1. Identifying your crawler (User-Agent header)
  2. Respecting Retry-After headers if the site returns 429
  3. Crawling during off-peak hours
  4. Using residential proxies for large-scale projects

Wage Transparency and Ethics

Job salary data enables wage equity analysis and transparency. Use it responsibly:

  • Aggregate salary by role, location, and experience level.
  • Avoid reverse-identifying individuals from job posting metadata.
  • Focus on aggregate trends (median, percentiles) not individual records.
  • Disclose data sources when publishing salary surveys.

FAQ

Q: How is fastCRW different from a job API like LinkedIn Recruiter or Indeed API?

A: APIs are official but expensive (per-record or per-month licensing) and often restricted to premium partners. Scraping public listings is free but requires respecting ToS and rate limits. fastCRW is ideal when you want:

  • Low cost for high volume
  • Data from multiple boards unified
  • Flexibility to run on your own schedule
  • Custom extraction fields beyond the API schema

Q: Can I commercialize job data I scrape?

A: Generally no. Most job boards' ToS forbid commercial reuse without licensing. You can scrape for internal recruiting use, but selling a "salary database" built from scraped Indeed listings violates their ToS. Check with a lawyer if you're unsure.

Q: How do I handle listings that require login?

A: Most public job boards don't require login to view listings. If one does, scraping the login-protected content likely violates ToS. Stick to public boards and use official APIs for protected data.

Q: What's the difference between scraping job boards and scraping careers pages?

A: Public job boards (Indeed, Glassdoor) have many listings and are crawled by bots routinely. Company careers pages are less monitored but less data-rich. Both are valid targets if ToS permits. fastCRW can handle both with the same extraction schema.

Q: How do I detect and remove old postings?

A: Track scraped_at timestamp and URL. If the same URL stops appearing in your crawls, it's likely expired. After 2+ weeks of absence, remove it. Or check if the URL 404s before deleting.

Q: Can fastCRW handle Indeed's dynamic pagination?

A: Yes. Use crawl with maxDepth: 2 and request markdown output. fastCRW's rendering will load pagination and extract listings from all visible pages.

Continue exploring

More from Use Cases

View all use cases

Related hubs

Keep the crawl path moving