What We're Building
A news aggregator that monitors a list of source homepages, extracts the latest headlines as structured data, removes near-duplicate stories across sources, and produces a clean daily digest. RSS feeds are inconsistent and often missing; scraping the homepage works everywhere. CRW turns each homepage into structured records so you never write a brittle CSS selector per site.
Architecture
- Extract — CRW's
/v1/extractpulls headlines + links from each source homepage with a JSON schema - Store — SQLite keeps seen articles so each runs only reports new items
- Dedupe — Title similarity collapses the same story across outlets
- Digest — A markdown digest, optionally summarized by an LLM
Prerequisites
- CRW running locally:
docker run -p 3000:3000 ghcr.io/us/crw:latest - Python 3.10+ and an OpenAI API key (for the extract step and optional summaries)
pip install firecrawl-py
Step 1: SDK Setup
from firecrawl import FirecrawlApp
# Self-hosted CRW
app = FirecrawlApp(api_key="fc-YOUR-KEY", api_url="http://localhost:3000")
# Or fastCRW cloud: api_url="https://api.fastcrw.com"
Step 2: Define the Headline Schema
One schema works across CNN, the BBC, TechCrunch, or any niche blog — the LLM reads the page semantically instead of matching HTML structure:
HEADLINE_SCHEMA = {
"type": "object",
"properties": {
"articles": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {"type": "string", "description": "The headline text"},
"url": {"type": "string", "description": "Absolute URL to the article"},
"summary": {"type": "string", "description": "One-line teaser if present"},
},
"required": ["title", "url"],
},
}
},
"required": ["articles"],
}
Step 3: Extract Headlines From a Source
from urllib.parse import urljoin
def fetch_headlines(homepage: str) -> list[dict]:
result = app.extract(
urls=[homepage],
params={
"prompt": "Extract the latest news article headlines and their links from this homepage. Ignore ads, navigation, and footer links.",
"schema": HEADLINE_SCHEMA,
},
)
if not result or "data" not in result:
return []
out = []
for a in result["data"].get("articles", []):
url = urljoin(homepage, a["url"]) # resolve relative links
out.append({"title": a["title"].strip(), "url": url,
"summary": a.get("summary", ""), "source": homepage})
return out
Step 4: Store Seen Articles
import sqlite3, hashlib
from datetime import datetime
DB = "news.db"
def init_db():
with sqlite3.connect(DB) as c:
c.execute("""CREATE TABLE IF NOT EXISTS articles (
id TEXT PRIMARY KEY, title TEXT, url TEXT, source TEXT,
seen_at TEXT)""")
def article_id(url: str) -> str:
return hashlib.sha256(url.encode()).hexdigest()[:16]
def is_new(url: str) -> bool:
with sqlite3.connect(DB) as c:
row = c.execute("SELECT 1 FROM articles WHERE id=?",
(article_id(url),)).fetchone()
return row is None
def mark_seen(a: dict):
with sqlite3.connect(DB) as c:
c.execute("INSERT OR IGNORE INTO articles VALUES (?,?,?,?,?)",
(article_id(a["url"]), a["title"], a["url"],
a["source"], datetime.now().isoformat()))
Step 5: Dedupe Near-Duplicate Stories
The same event gets reported by many outlets with slightly different titles. A normalized token-overlap check collapses them:
import re
def normalize(title: str) -> set[str]:
words = re.findall(r"[a-z]+", title.lower())
stop = {"the", "a", "an", "to", "of", "in", "on", "for", "and", "is", "as"}
return {w for w in words if w not in stop and len(w) > 2}
def jaccard(a: set, b: set) -> float:
if not a or not b:
return 0.0
return len(a & b) / len(a | b)
def dedupe(articles: list[dict], threshold: float = 0.6) -> list[dict]:
kept: list[dict] = []
sigs: list[set] = []
for art in articles:
sig = normalize(art["title"])
if any(jaccard(sig, s) >= threshold for s in sigs):
continue
kept.append(art)
sigs.append(sig)
return kept
Step 6: Build the Digest
def build_digest(sources: list[str]) -> str:
init_db()
fresh: list[dict] = []
for src in sources:
for art in fetch_headlines(src):
if is_new(art["url"]):
fresh.append(art)
fresh = dedupe(fresh)
for art in fresh:
mark_seen(art)
lines = [f"# News Digest — {datetime.now():%Y-%m-%d %H:%M}",
f"\n{len(fresh)} new stories\n"]
for art in fresh:
host = art["source"].split("/")[2]
lines.append(f"- [{art['title']}]({art['url']}) — *{host}*")
if art["summary"]:
lines.append(f" > {art['summary']}")
return "\n".join(lines)
if __name__ == "__main__":
SOURCES = [
"https://techcrunch.com",
"https://www.theverge.com",
"https://arstechnica.com",
]
digest = build_digest(SOURCES)
print(digest)
with open(f"digest-{datetime.now():%Y%m%d}.md", "w") as f:
f.write(digest)
Optional: LLM Summaries
To turn raw headlines into a paragraph briefing, scrape each new article and summarize it:
def summarize(article_url: str) -> str:
page = app.scrape_url(article_url, params={"formats": ["markdown"],
"onlyMainContent": True})
md = (page or {}).get("markdown", "")[:6000]
# send md to your LLM of choice for a 2-sentence summary
return md[:280] + "..." # placeholder; swap in your summarizer
Scheduling
Run it from cron — CRW's low idle memory footprint means the aggregator and CRW can share a $5 VPS:
# crontab -e
0 7 * * * cd /opt/news && /usr/bin/python3 aggregator.py >> cron.log 2>&1
Handling Source Diversity Without Per-Site Code
The reason this aggregator stays small is that it never models any individual site. A traditional headline scraper needs a parser per source: one for the publication that wraps stories in <article class="card">, another for the one that uses a JSON blob in a <script> tag, another for the SPA that renders client-side. Every redesign breaks one of them, and you find out when the digest goes silent. The schema approach delegates "what is a headline on this page" to the model, so the same fetch_headlines function works on a WordPress blog, a bespoke React news app, and a wire-service front page. When you add a source, you add a URL to a list — not a module.
There is a real tradeoff to acknowledge. LLM extraction costs more per page than a hand-tuned selector and can occasionally miss an item or pull a promoted "sponsored" story. Mitigate this with the schema itself: a clear prompt ("ignore ads, navigation, and footer links") and a description on each field steer the model. For sources you depend on heavily, add a post-extraction sanity check — for example, drop entries whose title is shorter than four words or whose URL host does not match the source domain. These guards are generic, not per-site, so they do not reintroduce the maintenance burden you were trying to escape.
Politeness, Caching, and Conditional Refresh
A homepage changes a handful of times a day, so re-extracting it every five minutes is wasteful and impolite. Two cheap improvements make the aggregator a good web citizen. First, cache the raw scrape briefly and only re-run extraction when the page body actually changed — hash the scraped markdown and skip the LLM call on an unchanged hash. Second, stagger sources so you never fire all requests in the same instant:
import time, random, hashlib
_page_cache: dict[str, str] = {}
def homepage_changed(url: str) -> bool:
doc = app.scrape_url(url, params={"formats": ["markdown"],
"onlyMainContent": True})
md = (doc or {}).get("markdown", "")
h = hashlib.sha256(md.encode()).hexdigest()
changed = _page_cache.get(url) != h
_page_cache[url] = h
return changed
def build_digest_polite(sources: list[str]) -> str:
fresh = []
for src in sources:
if not homepage_changed(src):
continue # nothing new, skip the LLM call
for art in fetch_headlines(src):
if is_new(art["url"]):
fresh.append(art)
time.sleep(random.uniform(1, 4)) # spread requests out
fresh = dedupe(fresh)
for a in fresh:
mark_seen(a)
return f"{len(fresh)} new stories after change-detection + dedupe"
This pattern cuts extraction cost dramatically on a fixed source list because most polls find an unchanged homepage and short-circuit before the expensive step. It also keeps your request rate modest, which matters when you are scraping publications that watch their traffic.
Turning the Digest Into a Feed
Once you have structured, deduped articles in SQLite, the digest is just one possible view. The same table powers an RSS/Atom feed (so readers consume it in their existing reader), a daily email, or a Slack post. Because every record already has a stable id, a title, a URL, and a source, generating any of these is a small templating step with no additional scraping. The aggregator's value is the clean structured store; the output format is interchangeable, and you can add new outputs without touching the collection logic.
Why CRW
- Schema extraction beats RSS — works on any homepage, no feed required, no per-site selectors.
- Fast enough for many sources — open-core Rust, small single binary, lower-latency, local-first.
- No lock-in — AGPL-3.0 self-host free, or managed cloud with one URL change.
Next Steps
- See Build an AI Price Tracker for the scheduled-monitoring pattern
- Read RAG Pipeline with CRW to make the digest queryable
Self-host CRW from GitHub for free, or use fastCRW for managed cloud scraping.
