Why scrape homepages instead of using RSS feeds?

Many news sites have no RSS feed, truncated feeds, or feeds that omit sections. Scraping the homepage with CRW's schema extraction works uniformly across every source and captures exactly the headlines a reader sees, with no per-site feed parsing.

How does the dedupe step avoid collapsing different stories?

It compares normalized title token sets with Jaccard similarity and only merges above a 0.6 threshold. Tune the threshold up if distinct stories get merged, or down if obvious duplicates slip through; 0.6 is a reasonable starting point for headline-length text.

Build a News Aggregator in Python with CRW (2026): Crawl, Dedupe, Summarize

What We're Building

A news aggregator that monitors a list of source homepages, extracts the latest headlines as structured data, removes near-duplicate stories across sources, and produces a clean daily digest. RSS feeds are inconsistent and often missing; scraping the homepage works everywhere. CRW turns each homepage into structured records so you never write a brittle CSS selector per site.

Architecture

Extract — CRW's /v1/extract pulls headlines + links from each source homepage with a JSON schema
Store — SQLite keeps seen articles so each runs only reports new items
Dedupe — Title similarity collapses the same story across outlets
Digest — A markdown digest, optionally summarized by an LLM

Prerequisites

CRW running locally: docker run -p 3000:3000 ghcr.io/us/crw:latest
Python 3.10+ and an OpenAI API key (for the extract step and optional summaries)

pip install firecrawl-py

Step 1: SDK Setup

from firecrawl import FirecrawlApp

# Self-hosted CRW
app = FirecrawlApp(api_key="fc-YOUR-KEY", api_url="http://localhost:3000")
# Or fastCRW cloud: api_url="https://api.fastcrw.com"

Step 2: Define the Headline Schema

One schema works across CNN, the BBC, TechCrunch, or any niche blog — the LLM reads the page semantically instead of matching HTML structure:

HEADLINE_SCHEMA = {
    "type": "object",
    "properties": {
        "articles": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "title": {"type": "string", "description": "The headline text"},
                    "url": {"type": "string", "description": "Absolute URL to the article"},
                    "summary": {"type": "string", "description": "One-line teaser if present"},
                },
                "required": ["title", "url"],
            },
        }
    },
    "required": ["articles"],
}

Step 3: Extract Headlines From a Source

from urllib.parse import urljoin


def fetch_headlines(homepage: str) -> list[dict]:
    result = app.extract(
        urls=[homepage],
        params={
            "prompt": "Extract the latest news article headlines and their links from this homepage. Ignore ads, navigation, and footer links.",
            "schema": HEADLINE_SCHEMA,
        },
    )
    if not result or "data" not in result:
        return []

    out = []
    for a in result["data"].get("articles", []):
        url = urljoin(homepage, a["url"])  # resolve relative links
        out.append({"title": a["title"].strip(), "url": url,
                     "summary": a.get("summary", ""), "source": homepage})
    return out

Step 4: Store Seen Articles

import sqlite3, hashlib
from datetime import datetime

DB = "news.db"


def init_db():
    with sqlite3.connect(DB) as c:
        c.execute("""CREATE TABLE IF NOT EXISTS articles (
            id TEXT PRIMARY KEY, title TEXT, url TEXT, source TEXT,
            seen_at TEXT)""")


def article_id(url: str) -> str:
    return hashlib.sha256(url.encode()).hexdigest()[:16]


def is_new(url: str) -> bool:
    with sqlite3.connect(DB) as c:
        row = c.execute("SELECT 1 FROM articles WHERE id=?",
                         (article_id(url),)).fetchone()
        return row is None


def mark_seen(a: dict):
    with sqlite3.connect(DB) as c:
        c.execute("INSERT OR IGNORE INTO articles VALUES (?,?,?,?,?)",
                  (article_id(a["url"]), a["title"], a["url"],
                   a["source"], datetime.now().isoformat()))

Step 5: Dedupe Near-Duplicate Stories

The same event gets reported by many outlets with slightly different titles. A normalized token-overlap check collapses them:

import re


def normalize(title: str) -> set[str]:
    words = re.findall(r"[a-z]+", title.lower())
    stop = {"the", "a", "an", "to", "of", "in", "on", "for", "and", "is", "as"}
    return {w for w in words if w not in stop and len(w) > 2}


def jaccard(a: set, b: set) -> float:
    if not a or not b:
        return 0.0
    return len(a & b) / len(a | b)


def dedupe(articles: list[dict], threshold: float = 0.6) -> list[dict]:
    kept: list[dict] = []
    sigs: list[set] = []
    for art in articles:
        sig = normalize(art["title"])
        if any(jaccard(sig, s) >= threshold for s in sigs):
            continue
        kept.append(art)
        sigs.append(sig)
    return kept

Step 6: Build the Digest

def build_digest(sources: list[str]) -> str:
    init_db()
    fresh: list[dict] = []
    for src in sources:
        for art in fetch_headlines(src):
            if is_new(art["url"]):
                fresh.append(art)

    fresh = dedupe(fresh)
    for art in fresh:
        mark_seen(art)

    lines = [f"# News Digest — {datetime.now():%Y-%m-%d %H:%M}",
             f"\n{len(fresh)} new stories\n"]
    for art in fresh:
        host = art["source"].split("/")[2]
        lines.append(f"- [{art['title']}]({art['url']}) — *{host}*")
        if art["summary"]:
            lines.append(f"  > {art['summary']}")
    return "\n".join(lines)


if __name__ == "__main__":
    SOURCES = [
        "https://techcrunch.com",
        "https://www.theverge.com",
        "https://arstechnica.com",
    ]
    digest = build_digest(SOURCES)
    print(digest)
    with open(f"digest-{datetime.now():%Y%m%d}.md", "w") as f:
        f.write(digest)

Optional: LLM Summaries

To turn raw headlines into a paragraph briefing, scrape each new article and summarize it:

def summarize(article_url: str) -> str:
    page = app.scrape_url(article_url, params={"formats": ["markdown"],
                                               "onlyMainContent": True})
    md = (page or {}).get("markdown", "")[:6000]
    # send md to your LLM of choice for a 2-sentence summary
    return md[:280] + "..."  # placeholder; swap in your summarizer

Scheduling

Run it from cron — CRW's low idle memory footprint means the aggregator and CRW can share a $5 VPS:

# crontab -e
0 7 * * *  cd /opt/news && /usr/bin/python3 aggregator.py >> cron.log 2>&1

Handling Source Diversity Without Per-Site Code

The reason this aggregator stays small is that it never models any individual site. A traditional headline scraper needs a parser per source: one for the publication that wraps stories in <article class="card">, another for the one that uses a JSON blob in a <script> tag, another for the SPA that renders client-side. Every redesign breaks one of them, and you find out when the digest goes silent. The schema approach delegates "what is a headline on this page" to the model, so the same fetch_headlines function works on a WordPress blog, a bespoke React news app, and a wire-service front page. When you add a source, you add a URL to a list — not a module.

There is a real tradeoff to acknowledge. LLM extraction costs more per page than a hand-tuned selector and can occasionally miss an item or pull a promoted "sponsored" story. Mitigate this with the schema itself: a clear prompt ("ignore ads, navigation, and footer links") and a description on each field steer the model. For sources you depend on heavily, add a post-extraction sanity check — for example, drop entries whose title is shorter than four words or whose URL host does not match the source domain. These guards are generic, not per-site, so they do not reintroduce the maintenance burden you were trying to escape.

Politeness, Caching, and Conditional Refresh

A homepage changes a handful of times a day, so re-extracting it every five minutes is wasteful and impolite. Two cheap improvements make the aggregator a good web citizen. First, cache the raw scrape briefly and only re-run extraction when the page body actually changed — hash the scraped markdown and skip the LLM call on an unchanged hash. Second, stagger sources so you never fire all requests in the same instant:

import time, random, hashlib

_page_cache: dict[str, str] = {}


def homepage_changed(url: str) -> bool:
    doc = app.scrape_url(url, params={"formats": ["markdown"],
                                      "onlyMainContent": True})
    md = (doc or {}).get("markdown", "")
    h = hashlib.sha256(md.encode()).hexdigest()
    changed = _page_cache.get(url) != h
    _page_cache[url] = h
    return changed


def build_digest_polite(sources: list[str]) -> str:
    fresh = []
    for src in sources:
        if not homepage_changed(src):
            continue                       # nothing new, skip the LLM call
        for art in fetch_headlines(src):
            if is_new(art["url"]):
                fresh.append(art)
        time.sleep(random.uniform(1, 4))   # spread requests out
    fresh = dedupe(fresh)
    for a in fresh:
        mark_seen(a)
    return f"{len(fresh)} new stories after change-detection + dedupe"

This pattern cuts extraction cost dramatically on a fixed source list because most polls find an unchanged homepage and short-circuit before the expensive step. It also keeps your request rate modest, which matters when you are scraping publications that watch their traffic.

Turning the Digest Into a Feed

Once you have structured, deduped articles in SQLite, the digest is just one possible view. The same table powers an RSS/Atom feed (so readers consume it in their existing reader), a daily email, or a Slack post. Because every record already has a stable id, a title, a URL, and a source, generating any of these is a small templating step with no additional scraping. The aggregator's value is the clean structured store; the output format is interchangeable, and you can add new outputs without touching the collection logic.

Why CRW

Schema extraction beats RSS — works on any homepage, no feed required, no per-site selectors.
Fast enough for many sources — open-core Rust, small single binary, lower-latency, local-first.
No lock-in — AGPL-3.0 self-host free, or managed cloud with one URL change.

Next Steps

See Build an AI Price Tracker for the scheduled-monitoring pattern
Read RAG Pipeline with CRW to make the digest queryable

Self-host CRW from GitHub for free, or use fastCRW for managed cloud scraping.