Use Cases/Use Case / Job Board Scraping

Web Scraping for Job Board Data

Use fastCRW to scrape job listings from public boards and build recruiting pipelines with structured data extraction.

Published

May 12, 2026

Updated

May 12, 2026

Verdict

fastCRW is the tool for building recruiting data pipelines from public job boards. While LinkedIn is off-limits by ToS, legal public sources like Indeed, Glassdoor, and ZipRecruiter expose rich structured data that fastCRW extracts into hiring-ready JSON. You get fresh listings daily, structured fields out of the box, and a flexible pipeline that costs far less than per-record job APIs. The tradeoff: you must respect ToS and rate limits, and parsing free data is no substitute for official job board APIs where they exist.

Why Job Boards Need Web Scraping

Job APIs from Indeed, LinkedIn, and Glassdoor exist, but they're expensive, rate-limited, or restricted to premium partners. Scraping public job listings directly lets you:

build a single unified feed across multiple job boards,
refresh listings every few hours without per-request costs,
extract salary trends and market intelligence,
power a custom recruiting dashboard or job alert system,
and unlock wage transparency and equity analysis.

Most job boards publish listings in accessible HTML. fastCRW turns that HTML into clean, structured JSON—job title, company, salary, location, requirements—ready for hiring workflows.

Where fastCRW Helps

Need	fastCRW Role
Multi-board aggregation	`crawl` multiple job boards and merge results into a single feed
Salary extraction	LLM extraction parses "$80K–$120K" into structured min/max fields
Pagination handling	`crawl` with `maxDepth: 2` follows pagination and loads all listings
Duplicate detection	Extract by URL + company + title, then deduplicate in your pipeline
Recurring updates	Schedule daily crawls to keep job data fresh

Architecture Overview

A typical job board scraping pipeline has four stages:

Discovery: Map the job board to find listing URLs and pagination structure.
Crawl: Fetch all listings across pages, respecting rate limits.
Extraction: Parse structured fields (title, company, salary, location, description).
Load: Deduplicate and insert into your recruiting database.

fastCRW handles discovery and crawl; your code handles extraction schema definition and deduplication.

Implementation Walkthrough

Here's a complete Python example that scrapes Indeed job listings for a role and location, extracts structured data, and deduplicates results.

Step 1: Install dependencies

uv venv
uv pip install requests python-dotenv

Step 2: Define your extraction schema

import json
import os
import requests
import time
from typing import Optional
from datetime import datetime

# Load API key from environment
FASTCRW_API_KEY = os.getenv("FASTCRW_API_KEY")
FASTCRW_BASE_URL = "https://api.fastcrw.com/v1"

# Define the extraction schema for job listings
JOB_EXTRACTION_SCHEMA = {
    "type": "object",
    "properties": {
        "job_title": {
            "type": "string",
            "description": "The job title or position name"
        },
        "company_name": {
            "type": "string",
            "description": "The company or organization name"
        },
        "location": {
            "type": "string",
            "description": "Job location or remote status"
        },
        "salary_min": {
            "type": "number",
            "description": "Minimum salary in USD, or null if not specified"
        },
        "salary_max": {
            "type": "number",
            "description": "Maximum salary in USD, or null if not specified"
        },
        "job_type": {
            "type": "string",
            "enum": ["Full-time", "Part-time", "Contract", "Temporary", "Unknown"],
            "description": "Employment type"
        },
        "description": {
            "type": "string",
            "description": "Job description summary (first 500 chars)"
        },
        "requirements": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Key requirements or qualifications"
        }
    },
    "required": ["job_title", "company_name", "location"]
}

def crawl_job_board(query: str, location: str, max_pages: int = 5) -> list[dict]:
    """
    Crawl Indeed job listings for a given role and location.
    
    Args:
        query: Job title or keywords (e.g., "software engineer")
        location: City or remote (e.g., "San Francisco" or "Remote")
        max_pages: Max pagination pages to crawl (default 5 = ~100 results)
    
    Returns:
        List of crawl URLs to process
    """
    # Build Indeed search URL
    base_url = "https://www.indeed.com/jobs"
    params = {"q": query, "l": location}
    search_url = f"{base_url}?q={query}&l={location}"
    
    print(f"Mapping Indeed search: {search_url}")
    
    # Step 1: Map the job board to find listing pages
    map_payload = {
        "url": search_url,
        "maxDepth": 2,  # Follow pagination
    }
    
    map_response = requests.post(
        f"{FASTCRW_BASE_URL}/map",
        headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
        json=map_payload
    )
    map_response.raise_for_status()
    urls_to_crawl = map_response.json().get("urls", [])
    
    print(f"Found {len(urls_to_crawl)} listing pages")
    return urls_to_crawl[:max_pages]

def extract_jobs(url: str) -> list[dict]:
    """
    Crawl a job listing page and extract structured job data.
    
    Args:
        url: URL of the job listing page
    
    Returns:
        List of extracted job records
    """
    print(f"Crawling {url}")
    
    # Step 2: Crawl the listing page
    crawl_payload = {
        "url": url,
        "formats": ["markdown"],
        "extraction": {
            "schema": JOB_EXTRACTION_SCHEMA
        }
    }
    
    response = requests.post(
        f"{FASTCRW_BASE_URL}/crawl",
        headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
        json=crawl_payload,
        timeout=60
    )
    response.raise_for_status()
    
    crawl_result = response.json()
    
    # Extract jobs from the crawl result
    jobs = []
    if "data" in crawl_result:
        for item in crawl_result["data"]:
            if "extractedData" in item:
                job = item["extractedData"]
                job["source_url"] = item.get("url")
                job["scraped_at"] = datetime.utcnow().isoformat()
                jobs.append(job)
    
    return jobs

def deduplicate_jobs(jobs: list[dict]) -> list[dict]:
    """
    Deduplicate job listings by title, company, and location.
    
    Args:
        jobs: List of job records
    
    Returns:
        Deduplicated list (keeps first occurrence)
    """
    seen = set()
    unique_jobs = []
    
    for job in jobs:
        # Create a dedup key from title, company, location
        key = (
            job.get("job_title", "").lower().strip(),
            job.get("company_name", "").lower().strip(),
            job.get("location", "").lower().strip()
        )
        
        if key not in seen:
            seen.add(key)
            unique_jobs.append(job)
    
    return unique_jobs

def format_for_database(jobs: list[dict]) -> str:
    """
    Format extracted jobs as JSONL (one JSON object per line).
    Ready to load into a database or data warehouse.
    
    Args:
        jobs: List of job records
    
    Returns:
        JSONL string
    """
    return "\n".join(json.dumps(job) for job in jobs)

# Example: Main scraping pipeline
if __name__ == "__main__":
    print("Job Board Scraping Pipeline")
    print("-" * 50)
    
    # Step 1: Crawl the job board
    search_urls = crawl_job_board(
        query="software engineer",
        location="San Francisco, CA",
        max_pages=3
    )
    
    # Step 2: Extract jobs from each page
    all_jobs = []
    for url in search_urls:
        try:
            jobs = extract_jobs(url)
            all_jobs.extend(jobs)
            time.sleep(1)  # Respect rate limits
        except Exception as e:
            print(f"Error crawling {url}: {e}")
    
    print(f"\nExtracted {len(all_jobs)} job listings")
    
    # Step 3: Deduplicate
    unique_jobs = deduplicate_jobs(all_jobs)
    print(f"After dedup: {len(unique_jobs)} unique listings")
    
    # Step 4: Format for database
    jsonl_output = format_for_database(unique_jobs)
    
    # Save to file
    with open("jobs.jsonl", "w") as f:
        f.write(jsonl_output)
    
    print(f"Saved to jobs.jsonl")
    
    # Show first result
    if unique_jobs:
        print("\nExample listing:")
        print(json.dumps(unique_jobs[0], indent=2))

Step 3: Run the scraper

export FASTCRW_API_KEY="your_api_key_here"
uv run python scrape_jobs.py

This creates jobs.jsonl with deduplicated listings ready to load into your recruiting database.

Production Considerations

Rate Limiting and Politeness

Job boards can block aggressive scrapers. fastCRW includes built-in rate limiting, but respect these practices:

Add delays between requests: Use time.sleep(1) or higher between page crawls.
Set a descriptive User-Agent: Let the site know who you are.
Crawl off-peak hours: Early morning or late evening puts less load on their servers.
Rotate IP addresses: For large-scale scrapes, use a proxy service (fastCRW supports residential proxies via Business plan).

Handling JavaScript-Rendered Listings

Some job boards (e.g., newer Glassdoor) load listings via JavaScript. fastCRW's LightPanda rendering (Pro plan) or Chrome rendering (Business plan) handles this automatically. Just request formats: ["markdown"] and let fastCRW render the page.

Deduplication Strategy

Job postings get reposted across multiple boards. Deduplicate by:

Exact match: title + company + location
Fuzzy match: If titles vary slightly, use Levenshtein distance (Python difflib library)
URL dedup: Some boards repost the same job with different URLs

The simple exact-match approach works for most cases.

Updating Old Data

Run your scraper daily to keep job data fresh. Most job boards expire listings after 30–60 days. Track scraped_at timestamps and remove stale entries monthly.

Legal and Ethical Notes

Respect Terms of Service

LinkedIn: Explicitly forbids scraping job listings. Do not scrape.
Indeed: Permits scraping in their ToS, but rate limiting is enforced. Use reasonable delays.
Glassdoor: Scraping is gray; check current ToS. Use with caution.
ZipRecruiter: Public listings are scrapable. Generally permissive.
Custom careers pages: Usually allowed unless the site's robots.txt forbids it.

Always check a site's robots.txt and ToS before scraping.

Rate Limiting

Aggressive scraping can trigger IP bans. fastCRW manages request throttling, but pair it with:

Identifying your crawler (User-Agent header)
Respecting Retry-After headers if the site returns 429
Crawling during off-peak hours
Using residential proxies for large-scale projects

Wage Transparency and Ethics

Job salary data enables wage equity analysis and transparency. Use it responsibly:

Aggregate salary by role, location, and experience level.
Avoid reverse-identifying individuals from job posting metadata.
Focus on aggregate trends (median, percentiles) not individual records.
Disclose data sources when publishing salary surveys.

FAQ

Q: How is fastCRW different from a job API like LinkedIn Recruiter or Indeed API?

A: APIs are official but expensive (per-record or per-month licensing) and often restricted to premium partners. Scraping public listings is free but requires respecting ToS and rate limits. fastCRW is ideal when you want:

Low cost for high volume
Data from multiple boards unified
Flexibility to run on your own schedule
Custom extraction fields beyond the API schema

Q: Can I commercialize job data I scrape?

A: Generally no. Most job boards' ToS forbid commercial reuse without licensing. You can scrape for internal recruiting use, but selling a "salary database" built from scraped Indeed listings violates their ToS. Check with a lawyer if you're unsure.

Q: How do I handle listings that require login?

A: Most public job boards don't require login to view listings. If one does, scraping the login-protected content likely violates ToS. Stick to public boards and use official APIs for protected data.

Q: What's the difference between scraping job boards and scraping careers pages?

A: Public job boards (Indeed, Glassdoor) have many listings and are crawled by bots routinely. Company careers pages are less monitored but less data-rich. Both are valid targets if ToS permits. fastCRW can handle both with the same extraction schema.

Q: How do I detect and remove old postings?

A: Track scraped_at timestamp and URL. If the same URL stops appearing in your crawls, it's likely expired. After 2+ weeks of absence, remove it. Or check if the URL 404s before deleting.

Q: Can fastCRW handle Indeed's dynamic pagination?

A: Yes. Use crawl with maxDepth: 2 and request markdown output. fastCRW's rendering will load pagination and extract listings from all visible pages.

Firecrawl alternatives — comparison if you're evaluating managed APIs for recruiting pipelines
Scrapfly alternatives — proxy-rotating option for boards that block easily
n8n integration — schedule daily listing scrapes with no code
Make integration — same pattern, different automation surface
Lead enrichment — pair job data with company/recruiter enrichment for sourcing
Competitor monitoring — track which roles your competitors are hiring for

Sources

fastCRW API Docs — Crawl Endpoint

https://docs.fastcrw.com/api/crawl

fastCRW LLM Extraction — Structured Output

https://docs.fastcrw.com/features/extraction

Indeed Terms of Service — Public Scraping Policy

https://indeed.com/legal/terms-of-service

Job Board Ethical Scraping Guidelines

https://www.webscraper.io/best-practices

Python Requests Library — HTTP Headers

https://docs.python-requests.org

FAQ

Can I scrape LinkedIn job postings?

No. LinkedIn's Terms of Service explicitly prohibit scraping, including job listings. Doing so violates their terms and exposes you to legal action. Use public boards like Indeed, Glassdoor, ZipRecruiter, or official LinkedIn job APIs instead.

What about ToS compliance on public job boards?

Check the board's robots.txt and ToS first. Indeed and Glassdoor are more permissive than LinkedIn. Always respect rate-limit headers, use appropriate delays between requests, and identify your crawler with a User-Agent. fastCRW's built-in rate limiting helps here.

How do I handle pagination?

Use the `crawl` endpoint with `maxDepth` set to 2-3. fastCRW will follow pagination links and extract listings across all pages, respecting rate limits automatically.

Can fastCRW extract salary ranges?

Yes. Use LLM extraction with a schema that includes salary fields. fastCRW will parse 'Salary: $80K–$120K' from the job description and return structured min/max values.

What format should I use for downstream processing?

Request JSON output from the extraction endpoint. Each listing becomes a JSON object with fields like title, company, salary_min, salary_max, location, job_description, and url. This is easy to load into a database.

How often should I refresh my job data?

Daily for real-time recruiting. Job boards update listings constantly, and your database grows stale within hours. Schedule crawls during off-peak hours to respect server load.

Can I use this for wage discrimination analysis?

You can aggregate salary data for transparency and equity analysis. Use the data ethically—avoid reverse-identifying individuals. Focus on aggregate patterns (median salary by role/location) rather than tracking specific listings.

Recommended next step

Run a live scrape before you commit.

Use the hosted demo to test scrape, crawl, or map output with fastCRW semantics.

Try Playground

Continue exploring

More from Use Cases

View all use cases

Previous in Use Cases

Web Scraping for Competitor Monitoring

Next in Use Cases

Web Scraping for Market Research

Use Cases

Web Scraping for LLM Training Data

Use fastCRW to crawl domains into markdown, deduplicate, filter quality, and output JSONL for fine-tuning and RAG datasets.

llm training data web scrapingCrawl entire domains into clean markdown with automatic deduplication

Use Cases

Web Scraping for Brand Monitoring

Monitor brand mentions across the web using fastCRW search + scrape: discover mentions on news sites, blogs, and forums, extract sentiment, and get real-time alerts.

brand monitoring web scrapingSearch the web for brand mentions using `/v1/search` endpoint

Use Cases

Web Scraping for News Aggregation

Build a news aggregation pipeline with fastCRW: discover URLs across news sites, scrape full articles, deduplicate content, and summarize with LLM extraction.

news aggregation apiDiscover news URLs via RSS sitemaps and `/v1/map` endpoint

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.

Web Scraping for Job Board Data

Verdict

Why Job Boards Need Web Scraping

Where fastCRW Helps

Architecture Overview

Implementation Walkthrough

Step 1: Install dependencies

Step 2: Define your extraction schema

Step 3: Run the scraper

Production Considerations

Rate Limiting and Politeness

Handling JavaScript-Rendered Listings

Deduplication Strategy

Updating Old Data

Legal and Ethical Notes

Respect Terms of Service

Rate Limiting

Wage Transparency and Ethics

FAQ

Related resources

More from Use Cases

Web Scraping for LLM Training Data

Web Scraping for Brand Monitoring

Web Scraping for News Aggregation

Keep the crawl path moving

Alternatives

Benchmarks

Docs