Skip to main content
Tutorial

Build a Jobs Aggregator in Python with CRW (2026): Crawl, Extract, Filter

Aggregate job postings across multiple career pages: crawl with CRW, extract structured roles via JSON schema, normalize, dedupe, and filter by keyword and location. Full Python — AGPL-3.0 self-host.

fastcrw
By RecepJune 19, 202615 min read

What We're Building

A jobs aggregator that crawls a list of company career pages, extracts each posting as a structured record, normalizes the messy fields (locations, remote flags, seniority), dedupes across sources, and lets you filter by keyword and location. Job boards rarely expose APIs and every careers page is built differently — CRW's crawl + JSON-schema extraction handles both problems.

Architecture

  • Crawl — CRW walks each careers site and returns posting pages
  • Extract — JSON schema turns each page into a typed Job
  • Normalize — clean locations, infer remote, bucket seniority
  • Store + Filter — SQLite with dedupe and query helpers

Prerequisites

  • CRW running: docker run -p 3000:3000 ghcr.io/us/crw:latest
  • Python 3.10+ and an OpenAI API key (for extraction)
pip install firecrawl-py

Step 1: SDK Setup

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR-KEY", api_url="http://localhost:3000")
# fastCRW cloud: api_url="https://api.fastcrw.com"

Step 2: Job Schema

JOB_SCHEMA = {
    "type": "object",
    "properties": {
        "jobs": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "location": {"type": "string",
                                 "description": "Raw location text as shown"},
                    "remote": {"type": "boolean",
                               "description": "True if the role is remote-eligible"},
                    "department": {"type": "string"},
                    "url": {"type": "string", "description": "Link to the posting"},
                    "employment_type": {"type": "string",
                                        "description": "e.g. Full-time, Contract"},
                },
                "required": ["title", "url"],
            },
        }
    },
    "required": ["jobs"],
}

Step 3: Crawl + Extract a Careers Site

Crawl the careers section, then run extraction over the listing pages:

from urllib.parse import urljoin


def scrape_company(careers_url: str, company: str) -> list[dict]:
    crawl = app.crawl_url(careers_url, params={
        "limit": 40, "maxDepth": 2,
        "scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
    })

    jobs: list[dict] = []
    for page in crawl.get("data", []):
        page_url = page.get("metadata", {}).get("sourceURL", careers_url)
        res = app.extract(urls=[page_url], params={
            "prompt": "Extract every job posting listed on this page. Ignore navigation and unrelated links.",
            "schema": JOB_SCHEMA,
        })
        if not res or "data" not in res:
            continue
        for j in res["data"].get("jobs", []):
            j["url"] = urljoin(page_url, j["url"])
            j["company"] = company
            jobs.append(j)
    return jobs

Step 4: Normalize Messy Fields

Raw location strings are inconsistent ("Remote - US", "SF / NYC", "Anywhere"). Normalize so filtering works:

import re


def normalize(job: dict) -> dict:
    loc = (job.get("location") or "").strip()
    low = loc.lower()

    is_remote = bool(job.get("remote")) or any(
        kw in low for kw in ("remote", "anywhere", "distributed"))

    # crude seniority bucket from the title
    title = job.get("title", "").lower()
    if any(k in title for k in ("staff", "principal", "lead", "senior", "sr.")):
        level = "senior"
    elif any(k in title for k in ("intern", "junior", "jr.", "entry")):
        level = "junior"
    else:
        level = "mid"

    cities = [c.strip() for c in re.split(r"[/|,]", loc) if c.strip()]

    return {**job, "is_remote": is_remote, "level": level,
            "cities": cities or ["unspecified"]}

Step 5: Store With Dedupe

import sqlite3, hashlib
from datetime import datetime

DB = "jobs.db"


def init_db():
    with sqlite3.connect(DB) as c:
        c.execute("""CREATE TABLE IF NOT EXISTS jobs (
            id TEXT PRIMARY KEY, company TEXT, title TEXT, url TEXT,
            location TEXT, is_remote INTEGER, level TEXT, seen_at TEXT)""")


def job_id(j: dict) -> str:
    key = f"{j['company']}|{j['title']}|{j['url']}".lower()
    return hashlib.sha256(key.encode()).hexdigest()[:16]


def save(jobs: list[dict]) -> int:
    added = 0
    with sqlite3.connect(DB) as c:
        for j in jobs:
            cur = c.execute(
                """INSERT OR IGNORE INTO jobs
                   VALUES (?,?,?,?,?,?,?,?)""",
                (job_id(j), j["company"], j["title"], j["url"],
                 j.get("location", ""), int(j["is_remote"]),
                 j["level"], datetime.now().isoformat()),
            )
            added += cur.rowcount
    return added

Step 6: Filter and Query

def search_jobs(keyword: str = "", remote_only: bool = False,
                level: str | None = None) -> list[dict]:
    q = "SELECT company, title, url, location, is_remote, level FROM jobs WHERE 1=1"
    args: list = []
    if keyword:
        q += " AND lower(title) LIKE ?"
        args.append(f"%{keyword.lower()}%")
    if remote_only:
        q += " AND is_remote = 1"
    if level:
        q += " AND level = ?"
        args.append(level)
    q += " ORDER BY seen_at DESC LIMIT 100"

    with sqlite3.connect(DB) as c:
        c.row_factory = sqlite3.Row
        return [dict(r) for r in c.execute(q, args).fetchall()]

Step 7: Run It

COMPANIES = {
    "Acme":  "https://acme.example.com/careers",
    "Globex": "https://globex.example.com/jobs",
}


def main():
    init_db()
    total_new = 0
    for company, url in COMPANIES.items():
        raw = scrape_company(url, company)
        normalized = [normalize(j) for j in raw]
        total_new += save(normalized)
        print(f"{company}: {len(normalized)} postings")

    print(f"\n{total_new} new jobs added")
    print("\nRemote senior engineering roles:")
    for j in search_jobs(keyword="engineer", remote_only=True, level="senior"):
        print(f"  [{j['company']}] {j['title']} — {j['url']}")


if __name__ == "__main__":
    main()

Make It Recurring

# crontab -e — refresh twice a day
0 8,20 * * *  cd /opt/jobs && /usr/bin/python3 aggregator.py >> jobs.log 2>&1

Dedupe on stable keys means re-runs only add genuinely new postings, so the digest stays signal, not noise.

Why Job Data Is Uniquely Messy

Of all the aggregation problems, jobs is among the worst for structure. The same role is "Senior Software Engineer", "Sr. SWE II", "Engineer III", and "Member of Technical Staff" across four companies. Locations are "Remote (US)", "SF or NYC", "Hybrid - 3 days", "Anywhere on Earth", and a literal map widget with no text. Some companies post to a hosted ATS (Greenhouse, Lever, Ashby), some hand-roll a React careers app, and some bury openings in a PDF. A selector-based aggregator needs a parser per ATS plus a fallback per custom site, and it silently rots as each one redesigns. The schema approach moves the burden of "find the postings on this arbitrary page" to the model, which is why scrape_company is the same function for every employer. The normalization layer then does the genuinely hard, generic work — collapsing title variants into seniority buckets and parsing freeform location strings — once, for all sources, instead of per site.

Designing the Normalizer to Be Honest

The temptation in a jobs aggregator is to over-normalize and lose information. If you map every title to a rigid taxonomy you will mis-bucket the genuinely novel roles and frustrate the user who searched for the exact thing you discarded. The design here is intentionally conservative: it adds derived fields (is_remote, level, cities) alongside the raw location and title rather than replacing them. Search and filtering use the derived fields for recall; display uses the originals for fidelity. This way a crude seniority heuristic that occasionally guesses "mid" wrong is a soft failure (the role still shows up in an unfiltered search) instead of a hard one (the role vanishes). When you extend the normalizer — adding salary parsing, visa-sponsorship detection, tech-stack tags — keep this rule: enrich, never destroy. The raw extracted record is your source of truth and should always survive into storage.

Keeping a Recurring Aggregator Trustworthy

An aggregator earns trust by being complete and current, and loses it by going stale invisibly. Two failure modes matter. A company removes a posting (filled or pulled) but your store still shows it — solve this with a last-seen timestamp per job and a sweep that marks anything not seen in the last N runs as expired rather than deleting it, so history is preserved but stale roles drop out of the default view. The second mode is a source that silently breaks (a careers page moves, returns a login wall, or starts rendering empty) — solve this by alerting when a source that historically returns dozens of jobs suddenly returns zero, which is almost always a scrape problem, not a hiring freeze:

def health_check(company: str, found: int, typical: int):
    if typical >= 5 and found == 0:
        # historically had postings, now none -> likely a broken source
        print(f"ALERT: {company} returned 0 jobs (usually ~{typical}). "
              f"Check the careers URL.")

These two guards — expiry by last-seen and a zero-result alarm on previously-productive sources — are what separate an aggregator people rely on from one they quietly stop checking because it "felt out of date." Both are generic and source-agnostic, so they scale with your source list for free.

Why CRW

  • No board API needed — crawl + schema extraction works on any careers page, Greenhouse, Lever, or hand-rolled.
  • One schema, many sites — no per-company selectors to maintain.
  • No lock-in — open-core Rust, small single binary, lower-latency, local-first, AGPL-3.0 + Managed Cloud.

Next Steps

Self-host CRW from GitHub for free, or use fastCRW for managed cloud scraping.

FAQ

Frequently asked questions

How do I aggregate jobs from sites with no API?
Crawl the careers section with CRW and run JSON-schema extraction over the listing pages. CRW's LLM extraction reads each posting semantically, so it works on Greenhouse, Lever, or fully custom careers pages without per-site selectors or a board API.
How does dedupe avoid re-adding the same posting every run?
Each job gets a stable id hashed from company, title, and URL, used as the SQLite primary key with INSERT OR IGNORE. Re-running the aggregator only inserts genuinely new postings, so scheduled refreshes stay signal rather than noise.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive