Web Scraping for Job Board Data
Use fastCRW to scrape job listings from public boards and build recruiting pipelines with structured data extraction.
fastCRW excels at turning job listings into structured hiring data. Public boards like Indeed and Glassdoor expose their content in HTML; fastCRW extracts titles, companies, locations, salaries, and descriptions at scale. LinkedIn forbids scraping by ToS—respect that boundary. For legal public boards, you get a recruiting pipeline that updates continuously without the per-record cost of traditional job APIs.
Verdict
fastCRW is the tool for building recruiting data pipelines from public job boards. While LinkedIn is off-limits by ToS, legal public sources like Indeed, Glassdoor, and ZipRecruiter expose rich structured data that fastCRW extracts into hiring-ready JSON. You get fresh listings daily, structured fields out of the box, and a flexible pipeline that costs far less than per-record job APIs. The tradeoff: you must respect ToS and rate limits, and parsing free data is no substitute for official job board APIs where they exist.
Why Job Boards Need Web Scraping
Job APIs from Indeed, LinkedIn, and Glassdoor exist, but they're expensive, rate-limited, or restricted to premium partners. Scraping public job listings directly lets you:
- build a single unified feed across multiple job boards,
- refresh listings every few hours without per-request costs,
- extract salary trends and market intelligence,
- power a custom recruiting dashboard or job alert system,
- and unlock wage transparency and equity analysis.
Most job boards publish listings in accessible HTML. fastCRW turns that HTML into clean, structured JSON—job title, company, salary, location, requirements—ready for hiring workflows.
Where fastCRW Helps
| Need | fastCRW Role |
|---|---|
| Multi-board aggregation | crawl multiple job boards and merge results into a single feed |
| Salary extraction | LLM extraction parses "$80K–$120K" into structured min/max fields |
| Pagination handling | crawl with maxDepth: 2 follows pagination and loads all listings |
| Duplicate detection | Extract by URL + company + title, then deduplicate in your pipeline |
| Recurring updates | Schedule daily crawls to keep job data fresh |
Architecture Overview
A typical job board scraping pipeline has four stages:
- Discovery: Map the job board to find listing URLs and pagination structure.
- Crawl: Fetch all listings across pages, respecting rate limits.
- Extraction: Parse structured fields (title, company, salary, location, description).
- Load: Deduplicate and insert into your recruiting database.
fastCRW handles discovery and crawl; your code handles extraction schema definition and deduplication.
Implementation Walkthrough
Here's a complete Python example that scrapes Indeed job listings for a role and location, extracts structured data, and deduplicates results.
Step 1: Install dependencies
uv venv
uv pip install requests python-dotenv
Step 2: Define your extraction schema
import json
import os
import requests
import time
from typing import Optional
from datetime import datetime
# Load API key from environment
FASTCRW_API_KEY = os.getenv("FASTCRW_API_KEY")
FASTCRW_BASE_URL = "https://api.fastcrw.com/v1"
# Define the extraction schema for job listings
JOB_EXTRACTION_SCHEMA = {
"type": "object",
"properties": {
"job_title": {
"type": "string",
"description": "The job title or position name"
},
"company_name": {
"type": "string",
"description": "The company or organization name"
},
"location": {
"type": "string",
"description": "Job location or remote status"
},
"salary_min": {
"type": "number",
"description": "Minimum salary in USD, or null if not specified"
},
"salary_max": {
"type": "number",
"description": "Maximum salary in USD, or null if not specified"
},
"job_type": {
"type": "string",
"enum": ["Full-time", "Part-time", "Contract", "Temporary", "Unknown"],
"description": "Employment type"
},
"description": {
"type": "string",
"description": "Job description summary (first 500 chars)"
},
"requirements": {
"type": "array",
"items": {"type": "string"},
"description": "Key requirements or qualifications"
}
},
"required": ["job_title", "company_name", "location"]
}
def crawl_job_board(query: str, location: str, max_pages: int = 5) -> list[dict]:
"""
Crawl Indeed job listings for a given role and location.
Args:
query: Job title or keywords (e.g., "software engineer")
location: City or remote (e.g., "San Francisco" or "Remote")
max_pages: Max pagination pages to crawl (default 5 = ~100 results)
Returns:
List of crawl URLs to process
"""
# Build Indeed search URL
base_url = "https://www.indeed.com/jobs"
params = {"q": query, "l": location}
search_url = f"{base_url}?q={query}&l={location}"
print(f"Mapping Indeed search: {search_url}")
# Step 1: Map the job board to find listing pages
map_payload = {
"url": search_url,
"maxDepth": 2, # Follow pagination
}
map_response = requests.post(
f"{FASTCRW_BASE_URL}/map",
headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
json=map_payload
)
map_response.raise_for_status()
urls_to_crawl = map_response.json().get("urls", [])
print(f"Found {len(urls_to_crawl)} listing pages")
return urls_to_crawl[:max_pages]
def extract_jobs(url: str) -> list[dict]:
"""
Crawl a job listing page and extract structured job data.
Args:
url: URL of the job listing page
Returns:
List of extracted job records
"""
print(f"Crawling {url}")
# Step 2: Crawl the listing page
crawl_payload = {
"url": url,
"formats": ["markdown"],
"extraction": {
"schema": JOB_EXTRACTION_SCHEMA
}
}
response = requests.post(
f"{FASTCRW_BASE_URL}/crawl",
headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
json=crawl_payload,
timeout=60
)
response.raise_for_status()
crawl_result = response.json()
# Extract jobs from the crawl result
jobs = []
if "data" in crawl_result:
for item in crawl_result["data"]:
if "extractedData" in item:
job = item["extractedData"]
job["source_url"] = item.get("url")
job["scraped_at"] = datetime.utcnow().isoformat()
jobs.append(job)
return jobs
def deduplicate_jobs(jobs: list[dict]) -> list[dict]:
"""
Deduplicate job listings by title, company, and location.
Args:
jobs: List of job records
Returns:
Deduplicated list (keeps first occurrence)
"""
seen = set()
unique_jobs = []
for job in jobs:
# Create a dedup key from title, company, location
key = (
job.get("job_title", "").lower().strip(),
job.get("company_name", "").lower().strip(),
job.get("location", "").lower().strip()
)
if key not in seen:
seen.add(key)
unique_jobs.append(job)
return unique_jobs
def format_for_database(jobs: list[dict]) -> str:
"""
Format extracted jobs as JSONL (one JSON object per line).
Ready to load into a database or data warehouse.
Args:
jobs: List of job records
Returns:
JSONL string
"""
return "\n".join(json.dumps(job) for job in jobs)
# Example: Main scraping pipeline
if __name__ == "__main__":
print("Job Board Scraping Pipeline")
print("-" * 50)
# Step 1: Crawl the job board
search_urls = crawl_job_board(
query="software engineer",
location="San Francisco, CA",
max_pages=3
)
# Step 2: Extract jobs from each page
all_jobs = []
for url in search_urls:
try:
jobs = extract_jobs(url)
all_jobs.extend(jobs)
time.sleep(1) # Respect rate limits
except Exception as e:
print(f"Error crawling {url}: {e}")
print(f"\nExtracted {len(all_jobs)} job listings")
# Step 3: Deduplicate
unique_jobs = deduplicate_jobs(all_jobs)
print(f"After dedup: {len(unique_jobs)} unique listings")
# Step 4: Format for database
jsonl_output = format_for_database(unique_jobs)
# Save to file
with open("jobs.jsonl", "w") as f:
f.write(jsonl_output)
print(f"Saved to jobs.jsonl")
# Show first result
if unique_jobs:
print("\nExample listing:")
print(json.dumps(unique_jobs[0], indent=2))
Step 3: Run the scraper
export FASTCRW_API_KEY="your_api_key_here"
uv run python scrape_jobs.py
This creates jobs.jsonl with deduplicated listings ready to load into your recruiting database.
Production Considerations
Rate Limiting and Politeness
Job boards can block aggressive scrapers. fastCRW includes built-in rate limiting, but respect these practices:
- Add delays between requests: Use
time.sleep(1)or higher between page crawls. - Set a descriptive User-Agent: Let the site know who you are.
- Crawl off-peak hours: Early morning or late evening puts less load on their servers.
- Rotate IP addresses: For large-scale scrapes, use a proxy service (fastCRW supports residential proxies via Business plan).
Handling JavaScript-Rendered Listings
Some job boards (e.g., newer Glassdoor) load listings via JavaScript. fastCRW's LightPanda rendering (Pro plan) or Chrome rendering (Business plan) handles this automatically. Just request formats: ["markdown"] and let fastCRW render the page.
Deduplication Strategy
Job postings get reposted across multiple boards. Deduplicate by:
- Exact match: title + company + location
- Fuzzy match: If titles vary slightly, use Levenshtein distance (Python
diffliblibrary) - URL dedup: Some boards repost the same job with different URLs
The simple exact-match approach works for most cases.
Updating Old Data
Run your scraper daily to keep job data fresh. Most job boards expire listings after 30–60 days. Track scraped_at timestamps and remove stale entries monthly.
Legal and Ethical Notes
Respect Terms of Service
- LinkedIn: Explicitly forbids scraping job listings. Do not scrape.
- Indeed: Permits scraping in their ToS, but rate limiting is enforced. Use reasonable delays.
- Glassdoor: Scraping is gray; check current ToS. Use with caution.
- ZipRecruiter: Public listings are scrapable. Generally permissive.
- Custom careers pages: Usually allowed unless the site's robots.txt forbids it.
Always check a site's robots.txt and ToS before scraping.
Rate Limiting
Aggressive scraping can trigger IP bans. fastCRW manages request throttling, but pair it with:
- Identifying your crawler (User-Agent header)
- Respecting
Retry-Afterheaders if the site returns 429 - Crawling during off-peak hours
- Using residential proxies for large-scale projects
Wage Transparency and Ethics
Job salary data enables wage equity analysis and transparency. Use it responsibly:
- Aggregate salary by role, location, and experience level.
- Avoid reverse-identifying individuals from job posting metadata.
- Focus on aggregate trends (median, percentiles) not individual records.
- Disclose data sources when publishing salary surveys.
FAQ
Q: How is fastCRW different from a job API like LinkedIn Recruiter or Indeed API?
A: APIs are official but expensive (per-record or per-month licensing) and often restricted to premium partners. Scraping public listings is free but requires respecting ToS and rate limits. fastCRW is ideal when you want:
- Low cost for high volume
- Data from multiple boards unified
- Flexibility to run on your own schedule
- Custom extraction fields beyond the API schema
Q: Can I commercialize job data I scrape?
A: Generally no. Most job boards' ToS forbid commercial reuse without licensing. You can scrape for internal recruiting use, but selling a "salary database" built from scraped Indeed listings violates their ToS. Check with a lawyer if you're unsure.
Q: How do I handle listings that require login?
A: Most public job boards don't require login to view listings. If one does, scraping the login-protected content likely violates ToS. Stick to public boards and use official APIs for protected data.
Q: What's the difference between scraping job boards and scraping careers pages?
A: Public job boards (Indeed, Glassdoor) have many listings and are crawled by bots routinely. Company careers pages are less monitored but less data-rich. Both are valid targets if ToS permits. fastCRW can handle both with the same extraction schema.
Q: How do I detect and remove old postings?
A: Track scraped_at timestamp and URL. If the same URL stops appearing in your crawls, it's likely expired. After 2+ weeks of absence, remove it. Or check if the URL 404s before deleting.
Q: Can fastCRW handle Indeed's dynamic pagination?
A: Yes. Use crawl with maxDepth: 2 and request markdown output. fastCRW's rendering will load pagination and extract listings from all visible pages.
Related resources
- Firecrawl alternatives — comparison if you're evaluating managed APIs for recruiting pipelines
- Scrapfly alternatives — proxy-rotating option for boards that block easily
- n8n integration — schedule daily listing scrapes with no code
- Make integration — same pattern, different automation surface
- Lead enrichment — pair job data with company/recruiter enrichment for sourcing
- Competitor monitoring — track which roles your competitors are hiring for
Continue exploring
More from Use Cases
Web Scraping for Competitor Monitoring
Web Scraping for Market Research
Web Scraping for LLM Training Data
Use fastCRW to crawl domains into markdown, deduplicate, filter quality, and output JSONL for fine-tuning and RAG datasets.
Web Scraping for Brand Monitoring
Monitor brand mentions across the web using fastCRW search + scrape: discover mentions on news sites, blogs, and forums, extract sentiment, and get real-time alerts.
Web Scraping for News Aggregation
Build a news aggregation pipeline with fastCRW: discover URLs across news sites, scrape full articles, deduplicate content, and summarize with LLM extraction.
Related hubs