Use Cases/Use Case / News Aggregation

Web Scraping for News Aggregation

Build a news aggregation pipeline with fastCRW: discover URLs across news sites, scrape full articles, deduplicate content, and summarize with LLM extraction.

Published

May 12, 2026

Updated

May 12, 2026

Why News Aggregation Needs Web Scraping

The news industry evolves faster than manual curation can handle. Hundreds of sources publish thousands of articles daily. RSS feeds, when they exist, often contain only headlines and summaries. Paywalled and premium sites strip content from their feeds entirely.

Automated scraping enables:

Real-time coverage across dozens of sources without manual work
Full-text extraction including body content, author, publication date, and images
Consistent formatting across diverse source sites (all converted to clean markdown)
Deduplication to surface unique stories, not duplicates across feeds
Structured metadata automatically extracted for search, filtering, and ranking

Without scraping, you're limited to RSS feeds (incomplete and inconsistent) or manually building integrations per source (not scalable).

Where fastCRW Helps

Aggregation need	fastCRW role
URL discovery	`/v1/map` finds all article URLs on news domains using sitemaps and crawl patterns
Full-text extraction	`/v1/scrape` returns clean markdown of articles with all metadata
Scheduled collection	Schedule scrapes on a recurring basis to catch new articles daily or hourly
Deduplication	fastCRW's consistent output format makes it easy to detect duplicate content via hashing
Extraction & summarization	Use LLM extraction to pull structured fields (headline, summary, author) automatically
Search integration	`/v1/search` finds news about specific topics across the web, then scrapes top results

Typical News Aggregation Flow

Identify target sources: Choose 10–50 news domains relevant to your audience (technology, finance, sports, world news, etc.)
Discover URLs: Use /v1/map on each domain to extract all article URLs. Most news sites have sitemaps or predictable URL patterns.
Schedule scrapes: Run /v1/scrape on discovered URLs daily (or more frequently for breaking news) to extract full article text.
Clean and deduplicate: Compare article content via hash or semantic similarity. Remove duplicates that appear across multiple sources.
Extract metadata: Use LLM extraction to automatically fill headline, author, publish date, and summary fields.
Store and index: Insert cleaned articles into your database with source attribution, timestamps, and topic tags.
Serve to users: Build a feed UI that surfaces new stories, trending topics, and personalized recommendations.
Monitor and alert: Detect keyword mentions across sources and trigger alerts for breaking news or trending topics.

Good Fits for News Aggregation

News platforms aggregating stories across 20+ sources for a unified feed
Industry monitoring dashboards tracking sector-specific news (finance, tech, healthcare)
Research teams building topic-specific corpora for analysis and trend detection
Alert services that notify users when specific keywords appear in the news
AI-powered news assistants that summarize trending topics or specific industries
Social media managers who need to identify trending news for commentary or engagement
Competitive intelligence teams monitoring competitor mentions and industry developments

Architecture: Building a Scalable News Aggregation Pipeline

A production news aggregation pipeline has several layers:

1. URL Discovery Layer Use /v1/map to crawl each news domain's sitemap and discover all article URLs. Most news sites update sitemaps automatically, so you can re-run discovery weekly to catch new sections or archived content.

2. Scraping Layer Schedule /v1/scrape requests on discovered URLs using your application server or a job queue. Prioritize high-traffic sources (BBC, Reuters, NY Times) for frequent scrapes, and scrape niche sources less often. Use fastCRW's scheduling features to distribute load.

3. Deduplication Layer After scraping, compute content hashes of article text. Store hashes in your database; if a new scrape returns a known hash, skip processing. This prevents the same story (published by multiple outlets) from cluttering your feed.

4. Extraction Layer Use LLM extraction (Claude via fastCRW) to pull structured data: headline, author, publish date, summary, category tags. This ensures consistent metadata across sources with different HTML structures.

5. Storage & Search Layer Insert deduplicated, extracted articles into your database (PostgreSQL, MongoDB) with full-text search indexing. Tag articles by source, topic, and keywords.

6. Feed & API Layer Serve articles via REST or GraphQL API. Implement pagination, filtering by source/topic/date, and sorting by recency or trending score.

7. Alerting & Monitoring Set up webhooks or cron jobs to detect breaking news (rapid article count spikes), keyword mentions, or sentiment shifts. Send notifications to users.

Implementation Walkthrough: News Aggregation Pipeline

Here's a working Python implementation that scrapes news articles, deduplicates them, and extracts metadata:

import requests
import hashlib
import json
from datetime import datetime, timedelta
from typing import Optional

# fastCRW API configuration
CRW_API_KEY = "your-api-key"
CRW_BASE_URL = "https://api.fastcrw.com/v1"

def map_news_domain(domain: str) -> list[str]:
    """Discover all article URLs on a news domain using /v1/map."""
    payload = {
        "url": domain,
        "useSitemap": True,
        "maxDepth": 2
    }
    
    response = requests.post(
        f"{CRW_BASE_URL}/map",
        json=payload,
        headers={"Authorization": f"Bearer {CRW_API_KEY}"}
    )
    
    if response.status_code == 200:
        data = response.json()
        return data.get("urls", [])
    else:
        print(f"Error mapping {domain}: {response.status_code}")
        return []

def scrape_article(url: str, extraction_schema: Optional[dict] = None) -> dict:
    """Scrape a news article and extract metadata using /v1/scrape."""
    payload = {
        "url": url,
        "format": "markdown",
        "extraction": {
            "schema": extraction_schema or {
                "type": "object",
                "properties": {
                    "headline": {"type": "string"},
                    "author": {"type": "string"},
                    "published_date": {"type": "string"},
                    "summary": {"type": "string"},
                    "category": {"type": "string"}
                }
            }
        }
    }
    
    response = requests.post(
        f"{CRW_BASE_URL}/scrape",
        json=payload,
        headers={"Authorization": f"Bearer {CRW_API_KEY}"}
    )
    
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error scraping {url}: {response.status_code}")
        return {}

def compute_content_hash(content: str) -> str:
    """Hash article content for deduplication."""
    return hashlib.sha256(content.encode()).hexdigest()

def deduplicate_articles(articles: list[dict], seen_hashes: set[str]) -> list[dict]:
    """Filter out duplicate articles based on content hash."""
    unique_articles = []
    
    for article in articles:
        content = article.get("markdown", "")
        content_hash = compute_content_hash(content)
        
        if content_hash not in seen_hashes:
            unique_articles.append(article)
            seen_hashes.add(content_hash)
    
    return unique_articles

def aggregate_news(source_domains: list[str]) -> dict:
    """Main pipeline: discover URLs, scrape articles, deduplicate, extract metadata."""
    all_articles = []
    seen_hashes = set()
    
    extraction_schema = {
        "type": "object",
        "properties": {
            "headline": {"type": "string", "description": "Article headline"},
            "author": {"type": "string", "description": "Author name"},
            "published_date": {"type": "string", "description": "Publication date (ISO 8601)"},
            "summary": {"type": "string", "description": "One-sentence summary"},
            "category": {"type": "string", "description": "News category (Tech, Finance, World, etc.)"}
        },
        "required": ["headline", "published_date"]
    }
    
    # Phase 1: Discover URLs
    print("Phase 1: Discovering article URLs...")
    all_urls = []
    for domain in source_domains:
        print(f"  Mapping {domain}...")
        urls = map_news_domain(domain)
        all_urls.extend(urls)
    
    print(f"Found {len(all_urls)} article URLs")
    
    # Phase 2: Scrape articles
    print("\nPhase 2: Scraping articles...")
    for i, url in enumerate(all_urls[:50]):  # Limit to 50 for demo
        print(f"  [{i+1}/{len(all_urls[:50])}] Scraping {url[:60]}...")
        article = scrape_article(url, extraction_schema)
        
        if article:
            all_articles.append({
                "url": url,
                "scraped_at": datetime.utcnow().isoformat(),
                **article
            })
    
    print(f"Scraped {len(all_articles)} articles")
    
    # Phase 3: Deduplicate
    print("\nPhase 3: Deduplicating content...")
    unique_articles = deduplicate_articles(all_articles, seen_hashes)
    
    print(f"After deduplication: {len(unique_articles)} unique articles")
    
    # Phase 4: Return summary
    return {
        "total_urls_discovered": len(all_urls),
        "articles_scraped": len(all_articles),
        "unique_articles": len(unique_articles),
        "articles": unique_articles
    }

# Example usage
if __name__ == "__main__":
    news_sources = [
        "https://www.bbc.com/news",
        "https://www.reuters.com",
        "https://techcrunch.com",
        "https://www.theverge.com",
        "https://www.cnbc.com"
    ]
    
    result = aggregate_news(news_sources)
    
    print("\n=== AGGREGATION SUMMARY ===")
    print(f"Total URLs: {result['total_urls_discovered']}")
    print(f"Articles scraped: {result['articles_scraped']}")
    print(f"Unique articles: {result['unique_articles']}")
    
    # Display first article
    if result['articles']:
        print("\n=== FIRST ARTICLE ===")
        article = result['articles'][0]
        print(f"Headline: {article.get('extraction', {}).get('headline', 'N/A')}")
        print(f"Author: {article.get('extraction', {}).get('author', 'N/A')}")
        print(f"Published: {article.get('extraction', {}).get('published_date', 'N/A')}")
        print(f"Summary: {article.get('extraction', {}).get('summary', 'N/A')}")
        print(f"Source: {article['url']}")

Production Considerations

Scaling to thousands of articles:

Use a task queue (Celery, Bull, or fastCRW's built-in scheduling) to parallelize scrapes across source domains
Implement incremental updates: only re-scrape articles changed in the last 24 hours
Cache URL lists for 7 days before re-mapping domains (most news sitemaps are stable)
Set up monitoring to alert when scrape success rate drops below 95%

Handling site changes and blocking:

Implement exponential backoff for failed scrapes (some sites temporarily block aggressive crawlers)
Monitor HTTP status codes; 429 (rate limit) and 403 (forbidden) indicate need for delays or proxy rotation
Use fastCRW's stealth mode to avoid detection (set appropriate user-agent and delay between requests)
For premium or paywall-protected content, verify compliance with each site's Terms of Service

Storage and indexing:

Store articles with full metadata (source, scraped_at, extracted headline, author, etc.) for filtering and sorting
Index article content with full-text search (PostgreSQL, Elasticsearch) for quick keyword lookups
Implement TTL on archived articles (e.g., delete articles older than 30 days) to control database size
Use Redis for caching popular queries and recently published articles

Feed quality and ranking:

Implement duplicate detection not just at the content level, but also by semantic similarity (TF-IDF, embeddings)
Rank articles by recency, engagement (clicks/shares), source reputation, and topic relevance
Surface breaking news by detecting rapid publish spikes in specific categories
Allow users to filter by source, category, and date range for personalization

Compliance and ethical scraping:

Respect robots.txt and Terms of Service for each news domain
Include proper attribution and source links in every feed entry
For paywalled content, link to the original rather than republishing full text
Implement user-agent headers and appropriate delays to avoid disrupting source sites
Monitor for DMCA takedown notices or complaints from publishers

Pricing Math: News Aggregation at Scale

Assume you want to aggregate 50 news sources daily, scraping ~2 new articles per source = 100 articles/day.

Breakdown:

URL mapping: 50 domains × 1 map/week ÷ 7 days = ~7 maps/day × 50 credits = 350 credits/week ÷ 7 = ~50 credits/day
Article scraping: 100 articles/day × 10 credits per scrape = 1,000 credits/day
Metadata extraction: 100 articles × 5 credits (LLM) = 500 credits/day
Total: ~1,550 credits/day = ~46,500 credits/month

Plan options:

Pro plan ($13/mo, 10,000 credits): Covers ~6 sources at daily refresh. Enough for an internal tool or specialized niche.
Business plan ($49/mo, 50,000 credits): Covers ~32 sources at daily refresh, or 50 sources at 2–3x weekly refresh. Good for public aggregation platforms.
Enterprise ($custom): For 1,000+ articles/day or real-time feeds with hourly scrapes on major outlets.

Cost optimization:

Prioritize high-traffic sources with hourly scrapes; scrape niche sources 2–3 times weekly
Cache URL lists for 7 days instead of re-mapping daily (saves ~7 credits/domain/day)
Use HTTP-only rendering (cheaper) for news sites; reserve Chrome rendering for complex layouts
Deduplicate aggressively to avoid re-extracting duplicate content

FAQ

Q: How do I keep my aggregation feed fresh without scraping every source constantly?

A: Implement tiered refresh rates. Scrape breaking-news sources (Reuters, AP, BBC) every 15–30 minutes. Mid-tier sources every 2–4 hours. Archive and niche sources daily or weekly. Prioritize by traffic and publish frequency.

Q: Can I legally republish news content I scrape?

A: It depends. If you're aggregating and attributing with links (like Google News), most publishers permit it under fair use. If you're republishing full text without permission, you risk DMCA takedowns. Always check robots.txt and Terms of Service. Many news sites explicitly permit aggregation with attribution.

Q: How do I detect breaking news automatically?

A: Monitor article publish timestamps and headline patterns. If 5+ sources publish similar headlines within 30 minutes, or if article count spikes unexpectedly, it's likely breaking news. Implement email/SMS alerts for specific keywords (company names, event types) to notify users immediately.

Q: Should I use RSS feeds or scraping?

A: Use both. RSS is fast and cheap for sources that maintain it (Reuters, BBC, NY Times offer RSS). Scrape sources that lack RSS or where RSS summaries are insufficient. fastCRW covers the gaps that feeds leave behind.

Q: How do I handle sites with complex JavaScript rendering?

A: Use fastCRW's LightPanda or Chrome rendering modes. They're more expensive per scrape (~$0.001–0.005 extra) but ensure you capture dynamically-loaded content. For news sites, HTTP scraping usually suffices; reserve JS rendering for tech sites with heavy client-side rendering.

Q: How long does it take to scrape 1,000 articles?

A: At ~2 seconds per article average, 1,000 articles takes ~33 minutes with serial scraping. With fastCRW's batch endpoint or parallel requests (10 concurrent), you can do it in 3–5 minutes. Use /v1/batch/scrape when available, or parallelize manually via your task queue.

Q: Can I get images and videos from articles?

A: fastCRW returns images in markdown format (embedded links). For video metadata, use extraction to pull video URLs from the page. Store image/video URLs in your article record; serve them via CDN to users.

Q: What's the difference between fastCRW news aggregation and Google News?

A: Google News is a curated, ranked feed with editorial features (local coverage, personalization, trending topics). fastCRW is a raw scraping/API layer that lets you build your own aggregation product. You control sources, ranking logic, and feature set.

Firecrawl alternatives — the managed scraping API many news pipelines start on before scaling self-host
Jina Reader alternatives — markdown-extraction comparison for article body cleanup
LangChain integration — chain scrape + summarize for downstream news summarization
Langflow integration — same flow, visual editor
Content aggregation — the broader pattern for any high-volume site corpus
Brand monitoring — filter the same article stream for your brand mentions
RAG pipelines — pipe articles into a vector store for retrieval-augmented chat

Sources

Industry report on news aggregation trends

https://www.feedspot.com/datasources/news_aggregators/

UX research on content discovery patterns

https://www.nngroup.com/articles/scannable-content/

Web performance best practices for content platforms

https://www.cloudflare.com/learning/web-performance/

Example of large-scale news aggregation

https://news.google.com/

Aggregator product categories and user expectations

https://www.producthunt.com/posts/news-aggregators

FAQ

How do I avoid aggregating duplicate articles from different news sources?

After scraping, compare article content using hash-based or semantic similarity checks. fastCRW returns consistent markdown across sources, making deduplication straightforward. Store article hashes in your database to skip re-processing identical content.

Which news sites block scrapers, and how do I handle them?

Most mainstream news sites (NYT, WSJ, FT) require subscriptions or block scrapers. Focus on sources with permissive robots.txt and Terms of Service. fastCRW respects robots.txt automatically. Always check each site's terms before scraping for compliance.

How often should I re-scrape news sites to catch updates?

News sites update constantly. Daily or even hourly scrapes work well for high-priority sources. Use fastCRW's credits flexibly: scrape major outlets hourly, niche sources daily. Monitor your feed lag to tune refresh rates.

Can I extract structured data like article metadata automatically?

Yes. Use the extraction parameter in `/v1/scrape` with a JSON schema for headline, author, publish date, and summary. fastCRW's LLM extraction (Claude) fills these fields automatically from article HTML.

How do I handle sites that require JavaScript rendering?

fastCRW handles JavaScript rendering automatically via LightPanda and Chrome options. For content-heavy sites with dynamic loading, use the LightPanda or Chrome rendering modes. This increases cost per scrape but ensures you capture fully-rendered content.

What's the typical cost to run a news aggregation platform at scale?

For 100 articles/day across 20 sources: ~500 credits/month at $13/mo (Pro plan covers 10,000 credits). Scale to 10,000 articles/day and upgrade to Business ($49/mo, 50,000 credits). Add LLM extraction at ~$0.01–0.05 per article for summarization.

How do I attribute stories to original sources while avoiding plagiarism?

Always include source attribution, publish dates, and direct links in your feed. Display summaries (not full text) if licensing is unclear. Most news sites permit aggregation with proper attribution. Check each source's robots.txt and terms before launching.

Can I search across my aggregated news corpus?

Yes. Use fastCRW's `/v1/search` endpoint to search the web directly, or build a search index on your own database of scraped articles. Combine web search with your aggregated content for comprehensive coverage.

How do I detect breaking news in real-time?

Scrape high-priority sources (Reuters, AP, BBC) more frequently (every 15–30 minutes) for breaking news. Implement change detection by comparing article counts or new headline patterns. Set up webhooks or polling on your database to trigger alerts.

Recommended next step