What We're Building
A zero-server scraping job that runs entirely inside GitHub Actions on a cron schedule. CRW runs as a service container next to the job, a Python step scrapes target pages, and the workflow commits the results back to the repo (or opens a PR when data changes). This is perfect for data snapshots, change detection, and small datasets — no VPS, no Lambda, no cost on public repos.
Why CRW Works Great in CI
- 8 MB Docker image — pulls in seconds, so the runner is not bottlenecked on image fetch.
- ~85 ms cold start — the service container is ready almost immediately.
- Stateless — nothing to persist between runs; the workflow is the only state.
Step 1: Repo Layout
.
├── .github/workflows/scrape.yml
├── scrape.py
└── data/ # snapshots committed here
Step 2: The Scrape Script
scrape.py talks to CRW at http://localhost:3000 — the service container is reachable on localhost from the job:
import json, os, sys, pathlib, hashlib
from firecrawl import FirecrawlApp
app = FirecrawlApp(
api_key=os.environ.get("CRW_API_KEY", "fc-ci"),
api_url="http://localhost:3000",
)
TARGETS = [
"https://example.com/pricing",
"https://example.com/changelog",
]
OUT = pathlib.Path("data")
OUT.mkdir(exist_ok=True)
def slug(url: str) -> str:
return hashlib.sha1(url.encode()).hexdigest()[:12]
def main() -> int:
changed = False
for url in TARGETS:
doc = app.scrape_url(url, params={"formats": ["markdown"],
"onlyMainContent": True})
md = (doc or {}).get("markdown", "")
if not md:
print(f"WARN empty: {url}", file=sys.stderr)
continue
path = OUT / f"{slug(url)}.md"
old = path.read_text() if path.exists() else ""
if old.strip() != md.strip():
path.write_text(md)
changed = True
print(f"changed: {url}")
else:
print(f"unchanged: {url}")
# Communicate change state to later workflow steps
gh_out = os.environ.get("GITHUB_OUTPUT")
if gh_out:
with open(gh_out, "a") as f:
f.write(f"changed={'true' if changed else 'false'}\n")
return 0
if __name__ == "__main__":
sys.exit(main())
Step 3: The Workflow
CRW runs as a services: container with a health check; the job waits until it is healthy before scraping:
name: scheduled-scrape
on:
schedule:
- cron: "0 6 * * *" # daily 06:00 UTC
workflow_dispatch: {} # manual trigger button
permissions:
contents: write
pull-requests: write
jobs:
scrape:
runs-on: ubuntu-latest
services:
crw:
image: ghcr.io/us/crw:latest
ports:
- 3000:3000
options: >-
--health-cmd "curl -f http://localhost:3000/health || exit 1"
--health-interval 10s
--health-timeout 5s
--health-retries 6
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install deps
run: pip install firecrawl-py
- name: Run scrape
id: scrape
env:
CRW_API_KEY: ${{ secrets.CRW_API_KEY }}
run: python scrape.py
- name: Open PR on change
if: steps.scrape.outputs.changed == 'true'
uses: peter-evans/create-pull-request@v6
with:
commit-message: "data: refresh scraped snapshots"
title: "Data update ${{ github.run_number }}"
body: "Automated snapshot refresh from scheduled scrape."
branch: data/auto-refresh
add-paths: data/
Step 4: Add the Secret
If your CRW image enforces an API key, set it in repo settings:
Settings → Secrets and variables → Actions → New repository secret
Name: CRW_API_KEY
Value: fc-your-long-random-key
For a private self-host you would point api_url at that host instead of the service container — the script changes only the one URL.
Step 5: Test Without Waiting for Cron
Use the manual trigger to validate the pipeline immediately:
gh workflow run scheduled-scrape
gh run watch
On the first run, every target is "changed" and a PR opens. On subsequent runs with no site changes, the job is a fast no-op and no PR appears.
When CI-Based Scraping Is the Right Tool
This pattern is deliberately constrained, and knowing its envelope keeps you out of trouble. It shines for low-frequency, low-volume jobs where the output is small enough to version: a daily pricing snapshot, a changelog watcher, a competitor's docs diff, a handful of pages whose history you want in git. The killer feature is that there is no infrastructure — no VPS to patch, no Lambda to deploy, no database to back up. The workflow file is the entire system, and git is the audit log.
It is the wrong tool when you need minute-level freshness (GitHub's scheduler can delay scheduled runs under load and offers no SLA on start time), when you scrape thousands of pages (you will blow CI minutes and runtime limits), or when the data is large or sensitive (committing it to a repo is the wrong store). For those, graduate to the always-on scheduled-crawl pattern on a server. A good rule of thumb: if a run takes under ~10 minutes, produces under a few megabytes, and tolerates being an hour late occasionally, CI is the cheapest correct answer. Otherwise, move it.
Making the Job Deterministic and Debuggable
CI scrapers fail in ways that are annoying to debug because you cannot attach a shell to a finished runner. Build in observability up front. Always upload the scraped output as an artifact even on failure, so a broken run leaves evidence:
- name: Upload snapshots
if: always()
uses: actions/upload-artifact@v4
with:
name: snapshots-${{ github.run_number }}
path: data/
retention-days: 7
Pin the CRW image to a version tag rather than :latest in the services: block so a run is reproducible — a green run last week and a red run today should never be explained by "the image changed underneath us." And exit non-zero from the script on a hard failure (the example's scrape.py writes to GITHUB_OUTPUT and could sys.exit(1) when every target returns empty) so the workflow goes red instead of silently committing nothing. A scraper that fails silently is worse than no scraper, because you stop looking at it.
Cost and Concurrency Reality
On public repositories, Actions minutes are free, which makes this pattern genuinely zero-cost for open data work. On private repositories you consume your plan's minutes, so be deliberate: a job that pulls an 8 MB image, installs one dependency, and scrapes a few pages runs in a couple of minutes, but a matrix that fans out across many source groups multiplies that. If you must scrape many sources, prefer one job that iterates internally (with a polite delay between requests) over a large matrix of parallel jobs — it is cheaper in minutes and far gentler on the sites you are scraping, which is also the responsible choice.
Patterns and Gotchas
- Commit vs PR — committing straight to
mainis simpler; a PR gives you a human review gate for data diffs. The example uses a PR. - Cron drift — GitHub schedules can be delayed under load. Do not assume exact-minute execution; design idempotent jobs.
- Rate-limit yourself — for many targets, add a small sleep between requests so you stay a polite client.
- Artifacts for large data — if snapshots get big, upload them as workflow artifacts instead of committing.
Why CRW for CI Scraping
- Tiny image — 8 MB pulls fast, keeping CI minutes low.
- Service-container friendly — stateless, instant health, no Redis or browser fleet to orchestrate.
- No lock-in — open-core Rust, small single binary, lower-latency, local-first, AGPL-3.0 + Managed Cloud. Swap to fastCRW cloud by changing
api_urlif you outgrow the runner.
A Diff-Aware Variant That Reports What Changed
Committing snapshots tells you that something changed via the git diff, but a richer workflow summarizes what changed directly in the PR body so a reviewer does not have to read raw markdown diffs. Extend the script to compute a per-target change summary:
import difflib
def summarize_change(old: str, new: str) -> str:
if not old:
return "new page captured"
diff = list(difflib.unified_diff(
old.splitlines(), new.splitlines(), lineterm="", n=0))
added = sum(1 for l in diff if l.startswith("+") and not l.startswith("+++"))
removed = sum(1 for l in diff if l.startswith("-") and not l.startswith("---"))
return f"+{added} / -{removed} lines"
# in main(), accumulate a markdown summary
summary_lines: list[str] = []
# ... when a page changes:
# summary_lines.append(f"- {url}: {summarize_change(old, md)}")
# write it for the PR step to consume
if summary_lines:
pathlib.Path("CHANGE_SUMMARY.md").write_text("\n".join(summary_lines))
Feed CHANGE_SUMMARY.md into the body of the create-pull-request step. Now the PR says "pricing page: +4 / -2 lines, changelog: new page captured" at a glance, turning a noisy data commit into a reviewable signal. This is the difference between a snapshot dump nobody reads and a change feed a human can triage in seconds.
Next Steps
- See Scheduled Crawls With Cron and CRW for the always-on variant
- Read Competitor Monitoring with CRW
Self-host CRW from GitHub for free, or use fastCRW for managed cloud scraping.
