Web Scraping for News Aggregation
Build a news aggregation pipeline with fastCRW: discover URLs across news sites, scrape full articles, deduplicate content, and summarize with LLM extraction.
fastCRW enables cost-effective, real-time news aggregation at scale. Combine scheduled scrapes with deduplication and LLM summarization to build a comprehensive news platform without maintaining complex crawler infrastructure.
Why News Aggregation Needs Web Scraping
The news industry evolves faster than manual curation can handle. Hundreds of sources publish thousands of articles daily. RSS feeds, when they exist, often contain only headlines and summaries. Paywalled and premium sites strip content from their feeds entirely.
Automated scraping enables:
- Real-time coverage across dozens of sources without manual work
- Full-text extraction including body content, author, publication date, and images
- Consistent formatting across diverse source sites (all converted to clean markdown)
- Deduplication to surface unique stories, not duplicates across feeds
- Structured metadata automatically extracted for search, filtering, and ranking
Without scraping, you're limited to RSS feeds (incomplete and inconsistent) or manually building integrations per source (not scalable).
Where fastCRW Helps
| Aggregation need | fastCRW role |
|---|---|
| URL discovery | /v1/map finds all article URLs on news domains using sitemaps and crawl patterns |
| Full-text extraction | /v1/scrape returns clean markdown of articles with all metadata |
| Scheduled collection | Schedule scrapes on a recurring basis to catch new articles daily or hourly |
| Deduplication | fastCRW's consistent output format makes it easy to detect duplicate content via hashing |
| Extraction & summarization | Use LLM extraction to pull structured fields (headline, summary, author) automatically |
| Search integration | /v1/search finds news about specific topics across the web, then scrapes top results |
Typical News Aggregation Flow
- Identify target sources: Choose 10–50 news domains relevant to your audience (technology, finance, sports, world news, etc.)
- Discover URLs: Use
/v1/mapon each domain to extract all article URLs. Most news sites have sitemaps or predictable URL patterns. - Schedule scrapes: Run
/v1/scrapeon discovered URLs daily (or more frequently for breaking news) to extract full article text. - Clean and deduplicate: Compare article content via hash or semantic similarity. Remove duplicates that appear across multiple sources.
- Extract metadata: Use LLM extraction to automatically fill headline, author, publish date, and summary fields.
- Store and index: Insert cleaned articles into your database with source attribution, timestamps, and topic tags.
- Serve to users: Build a feed UI that surfaces new stories, trending topics, and personalized recommendations.
- Monitor and alert: Detect keyword mentions across sources and trigger alerts for breaking news or trending topics.
Good Fits for News Aggregation
- News platforms aggregating stories across 20+ sources for a unified feed
- Industry monitoring dashboards tracking sector-specific news (finance, tech, healthcare)
- Research teams building topic-specific corpora for analysis and trend detection
- Alert services that notify users when specific keywords appear in the news
- AI-powered news assistants that summarize trending topics or specific industries
- Social media managers who need to identify trending news for commentary or engagement
- Competitive intelligence teams monitoring competitor mentions and industry developments
Architecture: Building a Scalable News Aggregation Pipeline
A production news aggregation pipeline has several layers:
1. URL Discovery Layer
Use /v1/map to crawl each news domain's sitemap and discover all article URLs. Most news sites update sitemaps automatically, so you can re-run discovery weekly to catch new sections or archived content.
2. Scraping Layer
Schedule /v1/scrape requests on discovered URLs using your application server or a job queue. Prioritize high-traffic sources (BBC, Reuters, NY Times) for frequent scrapes, and scrape niche sources less often. Use fastCRW's scheduling features to distribute load.
3. Deduplication Layer After scraping, compute content hashes of article text. Store hashes in your database; if a new scrape returns a known hash, skip processing. This prevents the same story (published by multiple outlets) from cluttering your feed.
4. Extraction Layer Use LLM extraction (Claude via fastCRW) to pull structured data: headline, author, publish date, summary, category tags. This ensures consistent metadata across sources with different HTML structures.
5. Storage & Search Layer Insert deduplicated, extracted articles into your database (PostgreSQL, MongoDB) with full-text search indexing. Tag articles by source, topic, and keywords.
6. Feed & API Layer Serve articles via REST or GraphQL API. Implement pagination, filtering by source/topic/date, and sorting by recency or trending score.
7. Alerting & Monitoring Set up webhooks or cron jobs to detect breaking news (rapid article count spikes), keyword mentions, or sentiment shifts. Send notifications to users.
Implementation Walkthrough: News Aggregation Pipeline
Here's a working Python implementation that scrapes news articles, deduplicates them, and extracts metadata:
import requests
import hashlib
import json
from datetime import datetime, timedelta
from typing import Optional
# fastCRW API configuration
CRW_API_KEY = "your-api-key"
CRW_BASE_URL = "https://api.fastcrw.com/v1"
def map_news_domain(domain: str) -> list[str]:
"""Discover all article URLs on a news domain using /v1/map."""
payload = {
"url": domain,
"useSitemap": True,
"maxDepth": 2
}
response = requests.post(
f"{CRW_BASE_URL}/map",
json=payload,
headers={"Authorization": f"Bearer {CRW_API_KEY}"}
)
if response.status_code == 200:
data = response.json()
return data.get("urls", [])
else:
print(f"Error mapping {domain}: {response.status_code}")
return []
def scrape_article(url: str, extraction_schema: Optional[dict] = None) -> dict:
"""Scrape a news article and extract metadata using /v1/scrape."""
payload = {
"url": url,
"format": "markdown",
"extraction": {
"schema": extraction_schema or {
"type": "object",
"properties": {
"headline": {"type": "string"},
"author": {"type": "string"},
"published_date": {"type": "string"},
"summary": {"type": "string"},
"category": {"type": "string"}
}
}
}
}
response = requests.post(
f"{CRW_BASE_URL}/scrape",
json=payload,
headers={"Authorization": f"Bearer {CRW_API_KEY}"}
)
if response.status_code == 200:
return response.json()
else:
print(f"Error scraping {url}: {response.status_code}")
return {}
def compute_content_hash(content: str) -> str:
"""Hash article content for deduplication."""
return hashlib.sha256(content.encode()).hexdigest()
def deduplicate_articles(articles: list[dict], seen_hashes: set[str]) -> list[dict]:
"""Filter out duplicate articles based on content hash."""
unique_articles = []
for article in articles:
content = article.get("markdown", "")
content_hash = compute_content_hash(content)
if content_hash not in seen_hashes:
unique_articles.append(article)
seen_hashes.add(content_hash)
return unique_articles
def aggregate_news(source_domains: list[str]) -> dict:
"""Main pipeline: discover URLs, scrape articles, deduplicate, extract metadata."""
all_articles = []
seen_hashes = set()
extraction_schema = {
"type": "object",
"properties": {
"headline": {"type": "string", "description": "Article headline"},
"author": {"type": "string", "description": "Author name"},
"published_date": {"type": "string", "description": "Publication date (ISO 8601)"},
"summary": {"type": "string", "description": "One-sentence summary"},
"category": {"type": "string", "description": "News category (Tech, Finance, World, etc.)"}
},
"required": ["headline", "published_date"]
}
# Phase 1: Discover URLs
print("Phase 1: Discovering article URLs...")
all_urls = []
for domain in source_domains:
print(f" Mapping {domain}...")
urls = map_news_domain(domain)
all_urls.extend(urls)
print(f"Found {len(all_urls)} article URLs")
# Phase 2: Scrape articles
print("\nPhase 2: Scraping articles...")
for i, url in enumerate(all_urls[:50]): # Limit to 50 for demo
print(f" [{i+1}/{len(all_urls[:50])}] Scraping {url[:60]}...")
article = scrape_article(url, extraction_schema)
if article:
all_articles.append({
"url": url,
"scraped_at": datetime.utcnow().isoformat(),
**article
})
print(f"Scraped {len(all_articles)} articles")
# Phase 3: Deduplicate
print("\nPhase 3: Deduplicating content...")
unique_articles = deduplicate_articles(all_articles, seen_hashes)
print(f"After deduplication: {len(unique_articles)} unique articles")
# Phase 4: Return summary
return {
"total_urls_discovered": len(all_urls),
"articles_scraped": len(all_articles),
"unique_articles": len(unique_articles),
"articles": unique_articles
}
# Example usage
if __name__ == "__main__":
news_sources = [
"https://www.bbc.com/news",
"https://www.reuters.com",
"https://techcrunch.com",
"https://www.theverge.com",
"https://www.cnbc.com"
]
result = aggregate_news(news_sources)
print("\n=== AGGREGATION SUMMARY ===")
print(f"Total URLs: {result['total_urls_discovered']}")
print(f"Articles scraped: {result['articles_scraped']}")
print(f"Unique articles: {result['unique_articles']}")
# Display first article
if result['articles']:
print("\n=== FIRST ARTICLE ===")
article = result['articles'][0]
print(f"Headline: {article.get('extraction', {}).get('headline', 'N/A')}")
print(f"Author: {article.get('extraction', {}).get('author', 'N/A')}")
print(f"Published: {article.get('extraction', {}).get('published_date', 'N/A')}")
print(f"Summary: {article.get('extraction', {}).get('summary', 'N/A')}")
print(f"Source: {article['url']}")
Production Considerations
Scaling to thousands of articles:
- Use a task queue (Celery, Bull, or fastCRW's built-in scheduling) to parallelize scrapes across source domains
- Implement incremental updates: only re-scrape articles changed in the last 24 hours
- Cache URL lists for 7 days before re-mapping domains (most news sitemaps are stable)
- Set up monitoring to alert when scrape success rate drops below 95%
Handling site changes and blocking:
- Implement exponential backoff for failed scrapes (some sites temporarily block aggressive crawlers)
- Monitor HTTP status codes; 429 (rate limit) and 403 (forbidden) indicate need for delays or proxy rotation
- Use fastCRW's stealth mode to avoid detection (set appropriate user-agent and delay between requests)
- For premium or paywall-protected content, verify compliance with each site's Terms of Service
Storage and indexing:
- Store articles with full metadata (source, scraped_at, extracted headline, author, etc.) for filtering and sorting
- Index article content with full-text search (PostgreSQL, Elasticsearch) for quick keyword lookups
- Implement TTL on archived articles (e.g., delete articles older than 30 days) to control database size
- Use Redis for caching popular queries and recently published articles
Feed quality and ranking:
- Implement duplicate detection not just at the content level, but also by semantic similarity (TF-IDF, embeddings)
- Rank articles by recency, engagement (clicks/shares), source reputation, and topic relevance
- Surface breaking news by detecting rapid publish spikes in specific categories
- Allow users to filter by source, category, and date range for personalization
Compliance and ethical scraping:
- Respect robots.txt and Terms of Service for each news domain
- Include proper attribution and source links in every feed entry
- For paywalled content, link to the original rather than republishing full text
- Implement user-agent headers and appropriate delays to avoid disrupting source sites
- Monitor for DMCA takedown notices or complaints from publishers
Pricing Math: News Aggregation at Scale
Assume you want to aggregate 50 news sources daily, scraping ~2 new articles per source = 100 articles/day.
Breakdown:
- URL mapping: 50 domains × 1 map/week ÷ 7 days = ~7 maps/day × 50 credits = 350 credits/week ÷ 7 = ~50 credits/day
- Article scraping: 100 articles/day × 10 credits per scrape = 1,000 credits/day
- Metadata extraction: 100 articles × 5 credits (LLM) = 500 credits/day
- Total: ~1,550 credits/day = ~46,500 credits/month
Plan options:
- Pro plan ($13/mo, 10,000 credits): Covers ~6 sources at daily refresh. Enough for an internal tool or specialized niche.
- Business plan ($49/mo, 50,000 credits): Covers ~32 sources at daily refresh, or 50 sources at 2–3x weekly refresh. Good for public aggregation platforms.
- Enterprise ($custom): For 1,000+ articles/day or real-time feeds with hourly scrapes on major outlets.
Cost optimization:
- Prioritize high-traffic sources with hourly scrapes; scrape niche sources 2–3 times weekly
- Cache URL lists for 7 days instead of re-mapping daily (saves ~7 credits/domain/day)
- Use HTTP-only rendering (cheaper) for news sites; reserve Chrome rendering for complex layouts
- Deduplicate aggressively to avoid re-extracting duplicate content
FAQ
Q: How do I keep my aggregation feed fresh without scraping every source constantly?
A: Implement tiered refresh rates. Scrape breaking-news sources (Reuters, AP, BBC) every 15–30 minutes. Mid-tier sources every 2–4 hours. Archive and niche sources daily or weekly. Prioritize by traffic and publish frequency.
Q: Can I legally republish news content I scrape?
A: It depends. If you're aggregating and attributing with links (like Google News), most publishers permit it under fair use. If you're republishing full text without permission, you risk DMCA takedowns. Always check robots.txt and Terms of Service. Many news sites explicitly permit aggregation with attribution.
Q: How do I detect breaking news automatically?
A: Monitor article publish timestamps and headline patterns. If 5+ sources publish similar headlines within 30 minutes, or if article count spikes unexpectedly, it's likely breaking news. Implement email/SMS alerts for specific keywords (company names, event types) to notify users immediately.
Q: Should I use RSS feeds or scraping?
A: Use both. RSS is fast and cheap for sources that maintain it (Reuters, BBC, NY Times offer RSS). Scrape sources that lack RSS or where RSS summaries are insufficient. fastCRW covers the gaps that feeds leave behind.
Q: How do I handle sites with complex JavaScript rendering?
A: Use fastCRW's LightPanda or Chrome rendering modes. They're more expensive per scrape (~$0.001–0.005 extra) but ensure you capture dynamically-loaded content. For news sites, HTTP scraping usually suffices; reserve JS rendering for tech sites with heavy client-side rendering.
Q: How long does it take to scrape 1,000 articles?
A: At ~2 seconds per article average, 1,000 articles takes ~33 minutes with serial scraping. With fastCRW's batch endpoint or parallel requests (10 concurrent), you can do it in 3–5 minutes. Use /v1/batch/scrape when available, or parallelize manually via your task queue.
Q: Can I get images and videos from articles?
A: fastCRW returns images in markdown format (embedded links). For video metadata, use extraction to pull video URLs from the page. Store image/video URLs in your article record; serve them via CDN to users.
Q: What's the difference between fastCRW news aggregation and Google News?
A: Google News is a curated, ranked feed with editorial features (local coverage, personalization, trending topics). fastCRW is a raw scraping/API layer that lets you build your own aggregation product. You control sources, ranking logic, and feature set.
Related resources
- Firecrawl alternatives — the managed scraping API many news pipelines start on before scaling self-host
- Jina Reader alternatives — markdown-extraction comparison for article body cleanup
- LangChain integration — chain scrape + summarize for downstream news summarization
- Langflow integration — same flow, visual editor
- Content aggregation — the broader pattern for any high-volume site corpus
- Brand monitoring — filter the same article stream for your brand mentions
- RAG pipelines — pipe articles into a vector store for retrieval-augmented chat
Continue exploring
More from Use Cases
Web Scraping for Brand Monitoring
Web Scraping for Real Estate Data
Web Scraping for LLM Training Data
Use fastCRW to crawl domains into markdown, deduplicate, filter quality, and output JSONL for fine-tuning and RAG datasets.
Web Scraping for Price Monitoring
Use fastCRW to scrape competitor prices, track e-commerce changes, and trigger alerts when prices shift across markets.
Web Scraping for Competitor Monitoring
Use fastCRW to track competitor websites, pricing pages, feature launches, and content changes in real-time.
Related hubs