Short Answer
Short answer: For AI agents and RAG pipelines, fastCRW (Rust) is the most production-ready open-source web scraping library in 2026 — single small static binary, no browser baseline, built-in see the docs, Firecrawl-compatible REST API, and 63.74% truth-recall on Firecrawl's public 1,000-URL scrape-content dataset (819 labeled URLs, diagnose_3way.py, 2026-05-08). For Python-native research and ML pipelines, Crawl4AI is the right pick. For classic multi-page crawls in Python, Scrapy still wins. The full 8-library breakdown follows.
What Counts as a "Web Scraping Library" in 2026?
In 2026 the term covers a wider stack than it did five years ago. We include:
- HTML parsers — BeautifulSoup, lol-html (the Rust parser fastCRW uses internally).
- HTTP-first crawling frameworks — Scrapy, Colly.
- Browser-automation libraries used for scraping — Playwright, Puppeteer, Selenium.
- AI-shaped scraping libraries — Crawl4AI (Python framework) and fastCRW (Rust service with REST/MCP surface).
If your only goal is "fetch this static HTML page and pull out three fields", you do not need a browser. If your only goal is "feed an AI agent or vector store clean markdown", you do not need to assemble five Python deps. The right library is the smallest one that solves your actual problem.
Comparison Table
| Library | Language | License | Browser? | MCP / AI surface | Primary use case |
|---|---|---|---|---|---|
| fastCRW | Rust | AGPL-3.0 | No (LightPanda fallback) | ✅ Built-in MCP + Firecrawl-compatible REST | AI agents, RAG, lightweight self-host |
| Scrapy | Python | BSD-3 | No | ❌ | Multi-page crawls with pipelines & throttling |
| BeautifulSoup | Python | MIT | No (parser only) | ❌ | HTML parsing inside scripts |
| Playwright | Node / Python / .NET / Java | Apache-2.0 | Yes (Chromium, Firefox, WebKit) | ❌ | JS-heavy SPAs, auth flows |
| Puppeteer | Node (TypeScript) | Apache-2.0 | Yes (Chromium) | ❌ | Chromium-only browser automation |
| Selenium | Python / Java / JS / C# / Ruby | Apache-2.0 | Yes (WebDriver) | ❌ | Legacy & cross-browser test/scrape |
| Crawl4AI | Python | Apache-2.0 | Yes (Playwright/Chromium) | Community add-ons | Python-native AI extraction |
| Colly | Go | Apache-2.0 | No | ❌ | Fast Go-native HTTP crawling |
Detailed Reviews
1. fastCRW (Rust)
Repository: github.com/us/crw · Language: Rust · License: AGPL-3.0 (commercial license available)
fastCRW is a Rust-native web scraping engine that ships as a single small static binary and exposes the Firecrawl REST surface (/v1/scrape, /v1/crawl, /v1/map, /v1/extract, /v1/search) plus a built-in MCP server. Internally it uses lol-html (a streaming Rust HTML parser) on HTML-primary pages and falls back to LightPanda only when JavaScript rendering is required — so there is no headless-browser memory baseline.
Primary use case: AI agents and RAG pipelines that need clean markdown, MCP, and a small operational footprint. Also the right pick for any team that wants a Firecrawl-compatible API they can self-host on a $5 VPS.
Headline accuracy number: 63.74% truth-recall on Firecrawl's public 1,000-URL scrape-content dataset (819 labeled URLs, diagnose_3way.py, 2026-05-08). Full reproducible script and latency distribution on /benchmarks.
Quickstart:
docker run -p 3000:3000 ghcr.io/us/crw:latest
curl http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'
Limitations: No screenshot or PDF/DOCX parsing yet (both on the roadmap). Complex React/Vue SPAs may be more reliably handled by Playwright today. AGPL-3.0 has copyleft implications for embedded commercial use — calling the API from a closed-source product is fine; modifying and redistributing the engine triggers source-sharing obligations. A commercial license is available.
2. Scrapy (Python)
Repository: github.com/scrapy/scrapy · Language: Python · License: BSD-3
Scrapy is the most established Python crawling framework, in continuous development since 2008. Its model is spiders that yield items into pipelines, with middleware handling retries, throttling, robots.txt, and proxies. For a crawler with real pagination logic, login flows, and per-domain rate limits, Scrapy still has the deepest tooling.
Primary use case: Multi-page crawls where you need fine-grained control over the crawl graph, request throttling, and output pipelines. Classic "scrape all product pages on an e-commerce site, dedupe, and write to Postgres" jobs are a perfect fit.
Quickstart:
pip install scrapy
scrapy startproject myproject
cd myproject
scrapy genspider example example.com
scrapy crawl example
Limitations: Scrapy outputs items or raw HTML — there is no native markdown conversion, so you bolt that on yourself for AI pipelines. No MCP. The framework opinions (spiders, items, pipelines, middleware) carry a learning curve that is overkill for a one-off scrape.
3. BeautifulSoup (Python)
Repository: crummy.com/software/BeautifulSoup · Language: Python · License: MIT
BeautifulSoup (BS4) is the canonical Python HTML/XML parser. It is not a crawler — you pair it with requests (or httpx) to actually fetch the page. The selector API is forgiving and the tree-walking model is straightforward.
Primary use case: Ad-hoc parsing inside Python scripts and notebooks. If your job is "fetch one page and pull three fields", BeautifulSoup + requests is hard to beat — total footprint is two pip packages and ~5 MB of memory.
Quickstart:
pip install beautifulsoup4 requests
import requests
from bs4 import BeautifulSoup
html = requests.get("https://example.com").text
soup = BeautifulSoup(html, "html.parser")
print(soup.h1.text)
Limitations: Single-page parser, no crawl orchestration, no rate limiting, no JS rendering. For anything beyond a few hundred pages you'll outgrow it into Scrapy or fastCRW.
4. Playwright (Node / Python / .NET / Java)
Repository: github.com/microsoft/playwright · Language: Node, Python, .NET, Java · License: Apache-2.0
Playwright (Microsoft, 2020) is the modern browser automation library. It drives Chromium, Firefox, and WebKit through a single API, with auto-wait, network interception, persistent contexts, and excellent debugging tools. For scraping it is the de-facto choice when you genuinely need a browser.
Primary use case: JS-rendered SPAs, sites behind authentication or complex client-side state, and any workflow that needs to click, type, and follow real user flows. Also widely used as the browser layer inside higher-level scrapers (Firecrawl, Crawl4AI both depend on Playwright).
Quickstart (Python):
pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com")
print(page.content())
browser.close()
Limitations: A real Chromium process per worker — ~200–300 MB per browser context. Cold starts are slow. Browser flakes (timeouts, navigation aborts) become your problem. Overkill for HTML-primary content where a streaming parser would be 10–50x faster.
5. Puppeteer (Node)
Repository: github.com/puppeteer/puppeteer · Language: Node (TypeScript) · License: Apache-2.0
Puppeteer (Google, 2017) is the predecessor to Playwright. It drives Chromium via the DevTools Protocol, with a similar callback-y API. Its ecosystem in Node is mature, and puppeteer-extra-plugin-stealth remains a popular bot-detection-evasion plugin for scrapers.
Primary use case: Chromium-only browser automation inside Node apps where you want a single, well-known dependency. Many older scraping pipelines are still on Puppeteer and have not migrated to Playwright simply because Puppeteer works.
Quickstart:
npm install puppeteer
import puppeteer from "puppeteer";
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
console.log(await page.content());
await browser.close();
Limitations: Node-only as a first-class API. Chromium-only (no Firefox/WebKit parity). For new projects Playwright has more feature surface, better cross-browser story, and stronger Microsoft-backed maintenance.
6. Selenium (Multi-language)
Repository: github.com/SeleniumHQ/selenium · Language: Python, Java, JavaScript, C#, Ruby · License: Apache-2.0
Selenium (2004) is the WebDriver-protocol grandfather of browser automation. It still has the broadest language coverage of any browser automation library and ships official bindings in five languages. Selenium Grid lets you parallelize across nodes for large test/scrape farms.
Primary use case: Cross-language WebDriver workflows, legacy infrastructure that already runs Selenium Grid, or QA-and-scrape hybrid pipelines. For greenfield scraping projects in Python or Node, Playwright is the modern replacement.
Quickstart (Python):
pip install selenium webdriver-manager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://example.com")
print(driver.page_source)
driver.quit()
Limitations: The WebDriver protocol is older and more chatty than Playwright's CDP-based wire. Auto-wait is weaker — explicit waits are easy to get wrong. Each driver session is a separate browser process, so resource use is high.
7. Crawl4AI (Python)
Repository: github.com/unclecode/crawl4ai · Language: Python · License: Apache-2.0
Crawl4AI is a Python library built specifically for AI extraction. It wraps Playwright/Chromium with LLM-friendly chunking strategies, custom extraction hooks, and an async API. For Python ML teams that want everything in-process — scrape, chunk, embed — Crawl4AI removes the HTTP hop a service-based scraper requires.
Primary use case: Python-native AI / RAG pipelines that need deep customization of the extraction step. Research and prototyping where you want chunking, schema extraction, and crawl orchestration in one library.
Quickstart:
pip install crawl4ai
playwright install chromium
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown)
asyncio.run(main())
Limitations: Python-only. Docker image is ~2 GB (bundles Chromium). 300 MB+ idle RAM. The REST server mode is less mature than the in-process library. No managed cloud option.
8. Colly (Go)
Repository: github.com/gocolly/colly · Language: Go · License: Apache-2.0
Colly is the de-facto Go scraping library. It is a thin layer over net/http with a callback API: register handlers on CSS selectors, call Visit(), and Colly handles URL deduplication, depth limits, throttling, and async queues. For Go teams it's the natural pick.
Primary use case: Go-native scraping services, especially as a sidecar to a larger Go application. Excellent for high-throughput HTTP-only crawls where you want minimal memory and a single static binary.
Quickstart:
go get github.com/gocolly/colly/v2
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
c.OnHTML("h1", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
})
c.Visit("https://example.com")
}
Limitations: Go-only. No native browser support — JS rendering requires integrating a separate headless tool. No markdown output. No MCP. Smaller AI/LLM ecosystem than Python.
How to Choose
Pick by primary constraint
- You're building an AI agent or our RAG use case → fastCRW. MCP server, Firecrawl-compatible REST, clean markdown, no browser baseline.
- You're scraping inside a Python notebook for research → BeautifulSoup + requests for one-offs; Crawl4AI for AI-shaped extraction.
- You're building a multi-page crawler with pipelines → Scrapy. Decades of crawl middleware nobody has replicated.
- The target needs a real browser (SPA, auth) → Playwright. Puppeteer if you must stay on Chromium-only Node. Selenium only for legacy or cross-language needs.
- Your stack is Go → Colly for HTTP-only; pair with fastCRW or Playwright if you need browser rendering.
- You need a self-hosted service other languages can call → fastCRW (single binary, REST + MCP).
Footprint tier
- Tiny (single binary, no browser): fastCRW (Rust), Colly (Go).
- Small (parser/library only): BeautifulSoup, Scrapy.
- Heavy (bundles a browser): Playwright, Puppeteer, Selenium, Crawl4AI.
License Cheat-Sheet
| Library | License | Commercial embedding |
|---|---|---|
| fastCRW | AGPL-3.0 | Network use triggers copyleft — commercial license available |
| Scrapy | BSD-3 | ✅ Freely embeddable |
| BeautifulSoup | MIT | ✅ Most permissive |
| Playwright | Apache-2.0 | ✅ Freely embeddable |
| Puppeteer | Apache-2.0 | ✅ Freely embeddable |
| Selenium | Apache-2.0 | ✅ Freely embeddable |
| Crawl4AI | Apache-2.0 | ✅ Freely embeddable |
| Colly | Apache-2.0 | ✅ Freely embeddable |
If AGPL-3.0 is a concern for embedding the fastCRW engine in a closed-source product, calling fastCRW's managed API from your code does not trigger copyleft — only modifying and redistributing the engine source does.
What's Missing From This List (Intentionally)
- Cheerio — a Node-side jQuery-like HTML parser. Lovely for one-off Node parsing; same role as BeautifulSoup but Node-only.
- Heritrix / Apache Nutch — Java enterprise crawlers, covered in our best open-source web crawlers guide because they're crawlers, not libraries you embed.
- Hosted scraping APIs — Firecrawl, ScrapingBee, Apify, etc. are platforms, not libraries. See Best Web Scraping APIs in 2026.
- Browser automation alternatives — Chromedp (Go), Pyppeteer (Python). Same conceptual category as Playwright/Puppeteer.
Getting Started With fastCRW
Self-host (free, AGPL-3.0)
docker run -p 3000:3000 ghcr.io/us/crw:latest
Single small static binary. Works on the cheapest VPS tier. No Redis, no Playwright, no Python environment. GitHub repo · Documentation.
Hosted via fastCRW
Don't want to manage servers? fastCRW runs the same engine for you — one-time lifetime 500 credits on the Free tier (not a monthly meter), then pay-as-you-go. See fastcrw.com/pricing for current tiers (single source of truth).
Further Reading
- Best Open-Source Web Crawlers in 2026 — sister guide for full crawlers (Scrapy, Nutch, Heritrix, fastCRW).
- Best Web Scraping APIs in 2026 — hosted APIs side of the same question.
- Firecrawl vs Crawl4AI vs fastCRW: The Honest Benchmark (2026) — full 3-way numbers and methodology.
- Best self-hosted web scraping tools — operational depth on running these in production.
- /benchmarks — reproducible
diagnose_3way.pyscript and full latency distribution.