Skip to main content
Use Cases/Use Case / News Aggregation

Web Scraping for News Aggregation

Build a news aggregation pipeline with fastCRW: discover URLs across news sites, scrape full articles, deduplicate content, and summarize with LLM extraction.

Published
May 12, 2026
Updated
May 12, 2026
Category
use cases
Verdict

fastCRW enables cost-effective, real-time news aggregation at scale. Combine scheduled scrapes with deduplication and LLM summarization to build a comprehensive news platform without maintaining complex crawler infrastructure.

Discover news URLs via RSS sitemaps and `/v1/map` endpointScrape full articles and extract structured metadata with `/v1/scrape`Deduplicate content and summarize with on-demand LLM extraction

Why News Aggregation Needs Web Scraping

The news industry evolves faster than manual curation can handle. Hundreds of sources publish thousands of articles daily. RSS feeds, when they exist, often contain only headlines and summaries. Paywalled and premium sites strip content from their feeds entirely.

Automated scraping enables:

  • Real-time coverage across dozens of sources without manual work
  • Full-text extraction including body content, author, publication date, and images
  • Consistent formatting across diverse source sites (all converted to clean markdown)
  • Deduplication to surface unique stories, not duplicates across feeds
  • Structured metadata automatically extracted for search, filtering, and ranking

Without scraping, you're limited to RSS feeds (incomplete and inconsistent) or manually building integrations per source (not scalable).

Where fastCRW Helps

Aggregation needfastCRW role
URL discovery/v1/map finds all article URLs on news domains using sitemaps and crawl patterns
Full-text extraction/v1/scrape returns clean markdown of articles with all metadata
Scheduled collectionSchedule scrapes on a recurring basis to catch new articles daily or hourly
DeduplicationfastCRW's consistent output format makes it easy to detect duplicate content via hashing
Extraction & summarizationUse LLM extraction to pull structured fields (headline, summary, author) automatically
Search integration/v1/search finds news about specific topics across the web, then scrapes top results

Typical News Aggregation Flow

  1. Identify target sources: Choose 10–50 news domains relevant to your audience (technology, finance, sports, world news, etc.)
  2. Discover URLs: Use /v1/map on each domain to extract all article URLs. Most news sites have sitemaps or predictable URL patterns.
  3. Schedule scrapes: Run /v1/scrape on discovered URLs daily (or more frequently for breaking news) to extract full article text.
  4. Clean and deduplicate: Compare article content via hash or semantic similarity. Remove duplicates that appear across multiple sources.
  5. Extract metadata: Use LLM extraction to automatically fill headline, author, publish date, and summary fields.
  6. Store and index: Insert cleaned articles into your database with source attribution, timestamps, and topic tags.
  7. Serve to users: Build a feed UI that surfaces new stories, trending topics, and personalized recommendations.
  8. Monitor and alert: Detect keyword mentions across sources and trigger alerts for breaking news or trending topics.

Good Fits for News Aggregation

  • News platforms aggregating stories across 20+ sources for a unified feed
  • Industry monitoring dashboards tracking sector-specific news (finance, tech, healthcare)
  • Research teams building topic-specific corpora for analysis and trend detection
  • Alert services that notify users when specific keywords appear in the news
  • AI-powered news assistants that summarize trending topics or specific industries
  • Social media managers who need to identify trending news for commentary or engagement
  • Competitive intelligence teams monitoring competitor mentions and industry developments

Architecture: Building a Scalable News Aggregation Pipeline

A production news aggregation pipeline has several layers:

1. URL Discovery Layer Use /v1/map to crawl each news domain's sitemap and discover all article URLs. Most news sites update sitemaps automatically, so you can re-run discovery weekly to catch new sections or archived content.

2. Scraping Layer Schedule /v1/scrape requests on discovered URLs using your application server or a job queue. Prioritize high-traffic sources (BBC, Reuters, NY Times) for frequent scrapes, and scrape niche sources less often. Use fastCRW's scheduling features to distribute load.

3. Deduplication Layer After scraping, compute content hashes of article text. Store hashes in your database; if a new scrape returns a known hash, skip processing. This prevents the same story (published by multiple outlets) from cluttering your feed.

4. Extraction Layer Use LLM extraction (Claude via fastCRW) to pull structured data: headline, author, publish date, summary, category tags. This ensures consistent metadata across sources with different HTML structures.

5. Storage & Search Layer Insert deduplicated, extracted articles into your database (PostgreSQL, MongoDB) with full-text search indexing. Tag articles by source, topic, and keywords.

6. Feed & API Layer Serve articles via REST or GraphQL API. Implement pagination, filtering by source/topic/date, and sorting by recency or trending score.

7. Alerting & Monitoring Set up webhooks or cron jobs to detect breaking news (rapid article count spikes), keyword mentions, or sentiment shifts. Send notifications to users.

Implementation Walkthrough: News Aggregation Pipeline

Here's a working Python implementation that scrapes news articles, deduplicates them, and extracts metadata:

import requests
import hashlib
import json
from datetime import datetime, timedelta
from typing import Optional

# fastCRW API configuration
CRW_API_KEY = "your-api-key"
CRW_BASE_URL = "https://api.fastcrw.com/v1"

def map_news_domain(domain: str) -> list[str]:
    """Discover all article URLs on a news domain using /v1/map."""
    payload = {
        "url": domain,
        "useSitemap": True,
        "maxDepth": 2
    }
    
    response = requests.post(
        f"{CRW_BASE_URL}/map",
        json=payload,
        headers={"Authorization": f"Bearer {CRW_API_KEY}"}
    )
    
    if response.status_code == 200:
        data = response.json()
        return data.get("urls", [])
    else:
        print(f"Error mapping {domain}: {response.status_code}")
        return []

def scrape_article(url: str, extraction_schema: Optional[dict] = None) -> dict:
    """Scrape a news article and extract metadata using /v1/scrape."""
    payload = {
        "url": url,
        "format": "markdown",
        "extraction": {
            "schema": extraction_schema or {
                "type": "object",
                "properties": {
                    "headline": {"type": "string"},
                    "author": {"type": "string"},
                    "published_date": {"type": "string"},
                    "summary": {"type": "string"},
                    "category": {"type": "string"}
                }
            }
        }
    }
    
    response = requests.post(
        f"{CRW_BASE_URL}/scrape",
        json=payload,
        headers={"Authorization": f"Bearer {CRW_API_KEY}"}
    )
    
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error scraping {url}: {response.status_code}")
        return {}

def compute_content_hash(content: str) -> str:
    """Hash article content for deduplication."""
    return hashlib.sha256(content.encode()).hexdigest()

def deduplicate_articles(articles: list[dict], seen_hashes: set[str]) -> list[dict]:
    """Filter out duplicate articles based on content hash."""
    unique_articles = []
    
    for article in articles:
        content = article.get("markdown", "")
        content_hash = compute_content_hash(content)
        
        if content_hash not in seen_hashes:
            unique_articles.append(article)
            seen_hashes.add(content_hash)
    
    return unique_articles

def aggregate_news(source_domains: list[str]) -> dict:
    """Main pipeline: discover URLs, scrape articles, deduplicate, extract metadata."""
    all_articles = []
    seen_hashes = set()
    
    extraction_schema = {
        "type": "object",
        "properties": {
            "headline": {"type": "string", "description": "Article headline"},
            "author": {"type": "string", "description": "Author name"},
            "published_date": {"type": "string", "description": "Publication date (ISO 8601)"},
            "summary": {"type": "string", "description": "One-sentence summary"},
            "category": {"type": "string", "description": "News category (Tech, Finance, World, etc.)"}
        },
        "required": ["headline", "published_date"]
    }
    
    # Phase 1: Discover URLs
    print("Phase 1: Discovering article URLs...")
    all_urls = []
    for domain in source_domains:
        print(f"  Mapping {domain}...")
        urls = map_news_domain(domain)
        all_urls.extend(urls)
    
    print(f"Found {len(all_urls)} article URLs")
    
    # Phase 2: Scrape articles
    print("\nPhase 2: Scraping articles...")
    for i, url in enumerate(all_urls[:50]):  # Limit to 50 for demo
        print(f"  [{i+1}/{len(all_urls[:50])}] Scraping {url[:60]}...")
        article = scrape_article(url, extraction_schema)
        
        if article:
            all_articles.append({
                "url": url,
                "scraped_at": datetime.utcnow().isoformat(),
                **article
            })
    
    print(f"Scraped {len(all_articles)} articles")
    
    # Phase 3: Deduplicate
    print("\nPhase 3: Deduplicating content...")
    unique_articles = deduplicate_articles(all_articles, seen_hashes)
    
    print(f"After deduplication: {len(unique_articles)} unique articles")
    
    # Phase 4: Return summary
    return {
        "total_urls_discovered": len(all_urls),
        "articles_scraped": len(all_articles),
        "unique_articles": len(unique_articles),
        "articles": unique_articles
    }

# Example usage
if __name__ == "__main__":
    news_sources = [
        "https://www.bbc.com/news",
        "https://www.reuters.com",
        "https://techcrunch.com",
        "https://www.theverge.com",
        "https://www.cnbc.com"
    ]
    
    result = aggregate_news(news_sources)
    
    print("\n=== AGGREGATION SUMMARY ===")
    print(f"Total URLs: {result['total_urls_discovered']}")
    print(f"Articles scraped: {result['articles_scraped']}")
    print(f"Unique articles: {result['unique_articles']}")
    
    # Display first article
    if result['articles']:
        print("\n=== FIRST ARTICLE ===")
        article = result['articles'][0]
        print(f"Headline: {article.get('extraction', {}).get('headline', 'N/A')}")
        print(f"Author: {article.get('extraction', {}).get('author', 'N/A')}")
        print(f"Published: {article.get('extraction', {}).get('published_date', 'N/A')}")
        print(f"Summary: {article.get('extraction', {}).get('summary', 'N/A')}")
        print(f"Source: {article['url']}")

Production Considerations

Scaling to thousands of articles:

  • Use a task queue (Celery, Bull, or fastCRW's built-in scheduling) to parallelize scrapes across source domains
  • Implement incremental updates: only re-scrape articles changed in the last 24 hours
  • Cache URL lists for 7 days before re-mapping domains (most news sitemaps are stable)
  • Set up monitoring to alert when scrape success rate drops below 95%

Handling site changes and blocking:

  • Implement exponential backoff for failed scrapes (some sites temporarily block aggressive crawlers)
  • Monitor HTTP status codes; 429 (rate limit) and 403 (forbidden) indicate need for delays or proxy rotation
  • Use fastCRW's stealth mode to avoid detection (set appropriate user-agent and delay between requests)
  • For premium or paywall-protected content, verify compliance with each site's Terms of Service

Storage and indexing:

  • Store articles with full metadata (source, scraped_at, extracted headline, author, etc.) for filtering and sorting
  • Index article content with full-text search (PostgreSQL, Elasticsearch) for quick keyword lookups
  • Implement TTL on archived articles (e.g., delete articles older than 30 days) to control database size
  • Use Redis for caching popular queries and recently published articles

Feed quality and ranking:

  • Implement duplicate detection not just at the content level, but also by semantic similarity (TF-IDF, embeddings)
  • Rank articles by recency, engagement (clicks/shares), source reputation, and topic relevance
  • Surface breaking news by detecting rapid publish spikes in specific categories
  • Allow users to filter by source, category, and date range for personalization

Compliance and ethical scraping:

  • Respect robots.txt and Terms of Service for each news domain
  • Include proper attribution and source links in every feed entry
  • For paywalled content, link to the original rather than republishing full text
  • Implement user-agent headers and appropriate delays to avoid disrupting source sites
  • Monitor for DMCA takedown notices or complaints from publishers

Pricing Math: News Aggregation at Scale

Assume you want to aggregate 50 news sources daily, scraping ~2 new articles per source = 100 articles/day.

Breakdown:

  • URL mapping: 50 domains × 1 map/week ÷ 7 days = ~7 maps/day × 50 credits = 350 credits/week ÷ 7 = ~50 credits/day
  • Article scraping: 100 articles/day × 10 credits per scrape = 1,000 credits/day
  • Metadata extraction: 100 articles × 5 credits (LLM) = 500 credits/day
  • Total: ~1,550 credits/day = ~46,500 credits/month

Plan options:

  • Pro plan ($13/mo, 10,000 credits): Covers ~6 sources at daily refresh. Enough for an internal tool or specialized niche.
  • Business plan ($49/mo, 50,000 credits): Covers ~32 sources at daily refresh, or 50 sources at 2–3x weekly refresh. Good for public aggregation platforms.
  • Enterprise ($custom): For 1,000+ articles/day or real-time feeds with hourly scrapes on major outlets.

Cost optimization:

  • Prioritize high-traffic sources with hourly scrapes; scrape niche sources 2–3 times weekly
  • Cache URL lists for 7 days instead of re-mapping daily (saves ~7 credits/domain/day)
  • Use HTTP-only rendering (cheaper) for news sites; reserve Chrome rendering for complex layouts
  • Deduplicate aggressively to avoid re-extracting duplicate content

FAQ

Q: How do I keep my aggregation feed fresh without scraping every source constantly?

A: Implement tiered refresh rates. Scrape breaking-news sources (Reuters, AP, BBC) every 15–30 minutes. Mid-tier sources every 2–4 hours. Archive and niche sources daily or weekly. Prioritize by traffic and publish frequency.

Q: Can I legally republish news content I scrape?

A: It depends. If you're aggregating and attributing with links (like Google News), most publishers permit it under fair use. If you're republishing full text without permission, you risk DMCA takedowns. Always check robots.txt and Terms of Service. Many news sites explicitly permit aggregation with attribution.

Q: How do I detect breaking news automatically?

A: Monitor article publish timestamps and headline patterns. If 5+ sources publish similar headlines within 30 minutes, or if article count spikes unexpectedly, it's likely breaking news. Implement email/SMS alerts for specific keywords (company names, event types) to notify users immediately.

Q: Should I use RSS feeds or scraping?

A: Use both. RSS is fast and cheap for sources that maintain it (Reuters, BBC, NY Times offer RSS). Scrape sources that lack RSS or where RSS summaries are insufficient. fastCRW covers the gaps that feeds leave behind.

Q: How do I handle sites with complex JavaScript rendering?

A: Use fastCRW's LightPanda or Chrome rendering modes. They're more expensive per scrape (~$0.001–0.005 extra) but ensure you capture dynamically-loaded content. For news sites, HTTP scraping usually suffices; reserve JS rendering for tech sites with heavy client-side rendering.

Q: How long does it take to scrape 1,000 articles?

A: At ~2 seconds per article average, 1,000 articles takes ~33 minutes with serial scraping. With fastCRW's batch endpoint or parallel requests (10 concurrent), you can do it in 3–5 minutes. Use /v1/batch/scrape when available, or parallelize manually via your task queue.

Q: Can I get images and videos from articles?

A: fastCRW returns images in markdown format (embedded links). For video metadata, use extraction to pull video URLs from the page. Store image/video URLs in your article record; serve them via CDN to users.

Q: What's the difference between fastCRW news aggregation and Google News?

A: Google News is a curated, ranked feed with editorial features (local coverage, personalization, trending topics). fastCRW is a raw scraping/API layer that lets you build your own aggregation product. You control sources, ranking logic, and feature set.

Continue exploring

More from Use Cases

View all use cases

Related hubs

Keep the crawl path moving