Skip to main content
Engineering

Scheduled Web Scraping in GitHub Actions With CRW (2026)

Run scrapes on a schedule for free with GitHub Actions: spin up CRW as a service container, scrape with Python, commit results, and open a PR on change. Full workflow YAML — no servers, AGPL-3.0.

fastcrw
June 6, 202613 min read

What We're Building

A zero-server scraping job that runs entirely inside GitHub Actions on a cron schedule. CRW runs as a service container next to the job, a Python step scrapes target pages, and the workflow commits the results back to the repo (or opens a PR when data changes). This is perfect for data snapshots, change detection, and small datasets — no VPS, no Lambda, no cost on public repos.

Why CRW Works Great in CI

  • 8 MB Docker image — pulls in seconds, so the runner is not bottlenecked on image fetch.
  • ~85 ms cold start — the service container is ready almost immediately.
  • Stateless — nothing to persist between runs; the workflow is the only state.

Step 1: Repo Layout

.
├── .github/workflows/scrape.yml
├── scrape.py
└── data/            # snapshots committed here

Step 2: The Scrape Script

scrape.py talks to CRW at http://localhost:3000 — the service container is reachable on localhost from the job:

import json, os, sys, pathlib, hashlib
from firecrawl import FirecrawlApp

app = FirecrawlApp(
    api_key=os.environ.get("CRW_API_KEY", "fc-ci"),
    api_url="http://localhost:3000",
)

TARGETS = [
    "https://example.com/pricing",
    "https://example.com/changelog",
]

OUT = pathlib.Path("data")
OUT.mkdir(exist_ok=True)


def slug(url: str) -> str:
    return hashlib.sha1(url.encode()).hexdigest()[:12]


def main() -> int:
    changed = False
    for url in TARGETS:
        doc = app.scrape_url(url, params={"formats": ["markdown"],
                                          "onlyMainContent": True})
        md = (doc or {}).get("markdown", "")
        if not md:
            print(f"WARN empty: {url}", file=sys.stderr)
            continue
        path = OUT / f"{slug(url)}.md"
        old = path.read_text() if path.exists() else ""
        if old.strip() != md.strip():
            path.write_text(md)
            changed = True
            print(f"changed: {url}")
        else:
            print(f"unchanged: {url}")

    # Communicate change state to later workflow steps
    gh_out = os.environ.get("GITHUB_OUTPUT")
    if gh_out:
        with open(gh_out, "a") as f:
            f.write(f"changed={'true' if changed else 'false'}\n")
    return 0


if __name__ == "__main__":
    sys.exit(main())

Step 3: The Workflow

CRW runs as a services: container with a health check; the job waits until it is healthy before scraping:

name: scheduled-scrape

on:
  schedule:
    - cron: "0 6 * * *"   # daily 06:00 UTC
  workflow_dispatch: {}    # manual trigger button

permissions:
  contents: write
  pull-requests: write

jobs:
  scrape:
    runs-on: ubuntu-latest
    services:
      crw:
        image: ghcr.io/us/crw:latest
        ports:
          - 3000:3000
        options: >-
          --health-cmd "curl -f http://localhost:3000/health || exit 1"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 6
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install deps
        run: pip install firecrawl-py

      - name: Run scrape
        id: scrape
        env:
          CRW_API_KEY: ${{ secrets.CRW_API_KEY }}
        run: python scrape.py

      - name: Open PR on change
        if: steps.scrape.outputs.changed == 'true'
        uses: peter-evans/create-pull-request@v6
        with:
          commit-message: "data: refresh scraped snapshots"
          title: "Data update ${{ github.run_number }}"
          body: "Automated snapshot refresh from scheduled scrape."
          branch: data/auto-refresh
          add-paths: data/

Step 4: Add the Secret

If your CRW image enforces an API key, set it in repo settings:

Settings → Secrets and variables → Actions → New repository secret
Name:  CRW_API_KEY
Value: fc-your-long-random-key

For a private self-host you would point api_url at that host instead of the service container — the script changes only the one URL.

Step 5: Test Without Waiting for Cron

Use the manual trigger to validate the pipeline immediately:

gh workflow run scheduled-scrape
gh run watch

On the first run, every target is "changed" and a PR opens. On subsequent runs with no site changes, the job is a fast no-op and no PR appears.

When CI-Based Scraping Is the Right Tool

This pattern is deliberately constrained, and knowing its envelope keeps you out of trouble. It shines for low-frequency, low-volume jobs where the output is small enough to version: a daily pricing snapshot, a changelog watcher, a competitor's docs diff, a handful of pages whose history you want in git. The killer feature is that there is no infrastructure — no VPS to patch, no Lambda to deploy, no database to back up. The workflow file is the entire system, and git is the audit log.

It is the wrong tool when you need minute-level freshness (GitHub's scheduler can delay scheduled runs under load and offers no SLA on start time), when you scrape thousands of pages (you will blow CI minutes and runtime limits), or when the data is large or sensitive (committing it to a repo is the wrong store). For those, graduate to the always-on scheduled-crawl pattern on a server. A good rule of thumb: if a run takes under ~10 minutes, produces under a few megabytes, and tolerates being an hour late occasionally, CI is the cheapest correct answer. Otherwise, move it.

Making the Job Deterministic and Debuggable

CI scrapers fail in ways that are annoying to debug because you cannot attach a shell to a finished runner. Build in observability up front. Always upload the scraped output as an artifact even on failure, so a broken run leaves evidence:

      - name: Upload snapshots
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: snapshots-${{ github.run_number }}
          path: data/
          retention-days: 7

Pin the CRW image to a version tag rather than :latest in the services: block so a run is reproducible — a green run last week and a red run today should never be explained by "the image changed underneath us." And exit non-zero from the script on a hard failure (the example's scrape.py writes to GITHUB_OUTPUT and could sys.exit(1) when every target returns empty) so the workflow goes red instead of silently committing nothing. A scraper that fails silently is worse than no scraper, because you stop looking at it.

Cost and Concurrency Reality

On public repositories, Actions minutes are free, which makes this pattern genuinely zero-cost for open data work. On private repositories you consume your plan's minutes, so be deliberate: a job that pulls an 8 MB image, installs one dependency, and scrapes a few pages runs in a couple of minutes, but a matrix that fans out across many source groups multiplies that. If you must scrape many sources, prefer one job that iterates internally (with a polite delay between requests) over a large matrix of parallel jobs — it is cheaper in minutes and far gentler on the sites you are scraping, which is also the responsible choice.

Patterns and Gotchas

  • Commit vs PR — committing straight to main is simpler; a PR gives you a human review gate for data diffs. The example uses a PR.
  • Cron drift — GitHub schedules can be delayed under load. Do not assume exact-minute execution; design idempotent jobs.
  • Rate-limit yourself — for many targets, add a small sleep between requests so you stay a polite client.
  • Artifacts for large data — if snapshots get big, upload them as workflow artifacts instead of committing.

Why CRW for CI Scraping

  • Tiny image — 8 MB pulls fast, keeping CI minutes low.
  • Service-container friendly — stateless, instant health, no Redis or browser fleet to orchestrate.
  • No lock-in — open-core Rust, small single binary, lower-latency, local-first, AGPL-3.0 + Managed Cloud. Swap to fastCRW cloud by changing api_url if you outgrow the runner.

A Diff-Aware Variant That Reports What Changed

Committing snapshots tells you that something changed via the git diff, but a richer workflow summarizes what changed directly in the PR body so a reviewer does not have to read raw markdown diffs. Extend the script to compute a per-target change summary:

import difflib


def summarize_change(old: str, new: str) -> str:
    if not old:
        return "new page captured"
    diff = list(difflib.unified_diff(
        old.splitlines(), new.splitlines(), lineterm="", n=0))
    added = sum(1 for l in diff if l.startswith("+") and not l.startswith("+++"))
    removed = sum(1 for l in diff if l.startswith("-") and not l.startswith("---"))
    return f"+{added} / -{removed} lines"


# in main(), accumulate a markdown summary
summary_lines: list[str] = []
# ... when a page changes:
#   summary_lines.append(f"- {url}: {summarize_change(old, md)}")

# write it for the PR step to consume
if summary_lines:
    pathlib.Path("CHANGE_SUMMARY.md").write_text("\n".join(summary_lines))

Feed CHANGE_SUMMARY.md into the body of the create-pull-request step. Now the PR says "pricing page: +4 / -2 lines, changelog: new page captured" at a glance, turning a noisy data commit into a reviewable signal. This is the difference between a snapshot dump nobody reads and a change feed a human can triage in seconds.

Next Steps

Self-host CRW from GitHub for free, or use fastCRW for managed cloud scraping.

FAQ

Frequently asked questions

Can I really run scheduled scraping for free?
Yes for public repositories — GitHub Actions minutes are free there. CRW runs as a service container in the job, the engine is AGPL-3.0 with zero per-request cost, and you only commit small data snapshots back. Private repos consume Actions minutes per their plan.
Why use a service container instead of installing CRW in the job?
A service container starts CRW alongside the job with a health check, so the API is reachable on localhost:3000 as soon as it is healthy. The 8 MB image pulls in seconds and there is nothing to build or install in the job itself.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More engineering posts

View category archive