What Is Web Scraping?
Web scraping is the process of automatically extracting data from websites. Instead of manually copying and pasting information, you write code (or use a tool) that visits web pages, reads the content, and saves what you need in a structured format.
Think of it this way: when you visit a webpage, your browser downloads HTML, CSS, and JavaScript, then renders it into the page you see. Web scraping does the same thing — downloads the page — but instead of rendering it visually, it extracts the text, links, images, or data you're interested in.
What Can You Do with Web Scraping?
- Research: Collect data for market research, academic studies, or competitive analysis
- Price monitoring: Track product prices across e-commerce sites
- Content aggregation: Build news feeds or knowledge bases from multiple sources
- Lead generation: Gather business contact information from directories
- AI and machine learning: Collect training data or build knowledge bases for RAG (Retrieval-Augmented Generation) systems
- SEO analysis: Audit website content, meta tags, and link structures
How Web Scraping Works
Every web scraping process follows the same basic steps:
- Request: Send an HTTP request to a URL (just like your browser does when you type a URL)
- Receive: Get back the HTML content of the page
- Parse: Read through the HTML to find the data you want
- Extract: Pull out specific pieces of information (text, links, prices, etc.)
- Store: Save the data somewhere useful (spreadsheet, database, file)
Static vs. Dynamic Websites
This is an important distinction for scraping:
- Static websites: All content is in the HTML that the server sends. Simple to scrape — just download the HTML and parse it.
- Dynamic websites (SPAs): Content is loaded by JavaScript after the page loads. Harder to scrape — you either need a browser engine that executes JavaScript, or you need a tool that handles this for you.
Most modern websites are at least partially dynamic. Social media feeds, e-commerce product pages, and dashboards almost always use JavaScript to load content. This is why simple HTTP requests often return empty or incomplete results — the JavaScript that populates the page hasn't run.
Legal and Ethical Considerations
Before scraping any website, understand the rules:
What's Generally Acceptable
- Scraping publicly available data (no login required)
- Respecting the site's
robots.txtfile (tells crawlers which pages to avoid) - Scraping at a reasonable rate (don't overload the server)
- Using the data for personal research, non-commercial purposes, or with permission
What to Be Careful About
- Terms of Service: Many sites explicitly prohibit scraping. Violating ToS can have legal consequences.
- Personal data: Scraping personal information (names, emails, photos) may violate privacy laws like GDPR or CCPA.
- Copyright: The content on websites is typically copyrighted. Scraping for republication can be infringement.
- Rate limiting: Sending too many requests can be considered a denial-of-service attack.
Best Practices
- Always check the site's
robots.txt(e.g.,https://example.com/robots.txt) - Add delays between requests (1–2 seconds minimum)
- Identify your scraper with a descriptive User-Agent string
- Prefer official APIs when available — many sites offer APIs specifically for data access
- If in doubt, ask permission
Web Scraping Tools Overview
There are many ways to scrape the web. Here's a quick overview of the main categories:
1. Code Libraries
Write code in your preferred language to fetch and parse web pages:
- Python: requests + Beautiful Soup, Scrapy
- JavaScript: Cheerio, Puppeteer, Playwright
- Go: Colly
Best for developers who want full control. Requires maintaining selectors and handling edge cases manually.
2. Browser Automation
Control a real browser programmatically to handle JavaScript-heavy sites:
- Puppeteer: Chrome/Chromium automation (JavaScript)
- Playwright: Cross-browser automation (JavaScript, Python, Java, .NET)
- Selenium: The veteran — supports many browsers and languages
Best for dynamic sites that require interaction (clicking buttons, scrolling, filling forms). Heavy and slow compared to HTTP-based approaches.
3. No-Code Tools
Visual interfaces for building scrapers without coding:
- Make.com: Visual workflow automation with HTTP modules
- n8n: Open-source workflow automation
- Apify: Cloud scraping platform with pre-built actors
Best for non-developers or quick prototypes. Limited flexibility compared to code.
4. Scraping APIs
Send a URL, get clean data back — the API handles rendering, parsing, and cleaning:
- CRW / fastCRW: Open-source, Firecrawl-compatible, returns clean markdown
- Firecrawl: Commercial scraping API
- ScrapingBee, ScraperAPI: Proxy-based scraping services
Best for developers who want clean output without managing scraping infrastructure. The approach we'll focus on in this guide.
Hands-On: Your First Scrape with CRW
Let's get hands-on. We'll use CRW — an open-source web scraping API that returns clean markdown from any URL. You can self-host it or use the cloud version.
Option A: Self-Host CRW (Free)
Run CRW locally with Docker:
docker run -p 3000:3000 ghcr.io/us/crw:latest
CRW is now running at http://localhost:3000. That's it — no configuration, no API key needed for local use.
Option B: Use fastCRW Cloud
Sign up at fastcrw.com and get an API key. No infrastructure to manage.
Scrape Your First Page
Let's scrape a page using curl (available on every operating system):
# Self-hosted (no API key needed)
curl -X POST http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'
# fastCRW Cloud
curl -X POST https://fastcrw.com/api/v1/scrape \
-H "Content-Type: application/json" \
-H "Authorization: Bearer fc-YOUR_API_KEY" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'
The response contains clean markdown — all the navigation bars, cookie banners, ads, and footers stripped away:
{
"success": true,
"data": {
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"metadata": {
"title": "Example Domain",
"sourceURL": "https://example.com",
"description": "..."
}
}
}
Scrape with Python
import requests
# Change to http://localhost:3000 for self-hosted
CRW_URL = "https://fastcrw.com/api"
response = requests.post(
f"{CRW_URL}/v1/scrape",
headers={
"Content-Type": "application/json",
"Authorization": "Bearer fc-YOUR_API_KEY",
},
json={
"url": "https://example.com",
"formats": ["markdown"],
},
)
data = response.json()
if data["success"]:
print(f"Title: {data['data']['metadata']['title']}")
print(f"Content:\n{data['data']['markdown']}")
else:
print(f"Error: {data.get('error')}")
Scrape with JavaScript
const response = await fetch("https://fastcrw.com/api/v1/scrape", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer fc-YOUR_API_KEY",
},
body: JSON.stringify({
url: "https://example.com",
formats: ["markdown"],
}),
});
const data = await response.json();
if (data.success) {
console.log(`Title: ${data.data.metadata.title}`);
console.log(`Content: ${data.data.markdown}`);
}
Going Further: Crawl an Entire Website
Scraping one page at a time is useful, but often you need content from an entire site. CRW's /v1/crawl endpoint handles this:
# Start a crawl (async — returns immediately with a job ID)
curl -X POST https://fastcrw.com/api/v1/crawl \
-H "Content-Type: application/json" \
-H "Authorization: Bearer fc-YOUR_API_KEY" \
-d '{
"url": "https://docs.example.com",
"limit": 20,
"scrapeOptions": { "formats": ["markdown"] }
}'
# Response: {"success": true, "id": "crawl-abc123"}
# Check the status (repeat until status is "completed")
curl https://fastcrw.com/api/v1/crawl/crawl-abc123 \
-H "Authorization: Bearer fc-YOUR_API_KEY"
When the crawl completes, you get an array of all pages with their markdown content. CRW automatically discovers linked pages and scrapes them — you don't need to find URLs manually.
Discover Pages Before Scraping
Sometimes you want to see what pages exist before deciding what to scrape. The /v1/map endpoint does exactly this:
curl -X POST https://fastcrw.com/api/v1/map \
-H "Content-Type: application/json" \
-H "Authorization: Bearer fc-YOUR_API_KEY" \
-d '{"url": "https://docs.example.com"}'
# Response: {"success": true, "links": ["https://docs.example.com/intro", "https://docs.example.com/api", ...]}
Map returns URLs without fetching content — much faster than a full crawl. Use it to preview a site's structure, filter to relevant pages, then scrape only what you need.
Extract Structured Data
Getting raw text is a great start, but sometimes you need structured data — product names, prices, dates, contact information. CRW's /v1/extract endpoint uses AI to pull structured data from pages:
curl -X POST https://fastcrw.com/api/v1/extract \
-H "Content-Type: application/json" \
-H "Authorization: Bearer fc-YOUR_API_KEY" \
-d '{
"urls": ["https://example.com/product"],
"prompt": "Extract the product name, price, and whether it is in stock",
"schema": {
"type": "object",
"properties": {
"product_name": { "type": "string" },
"price": { "type": "string" },
"in_stock": { "type": "boolean" }
}
}
}'
You describe what you want in plain English, optionally provide a JSON schema, and CRW returns structured data. No CSS selectors, no XPath, no regex.
Common Beginner Mistakes
1. Scraping Too Fast
Sending hundreds of requests per second will get your IP blocked and potentially cause issues for the website. Always add delays between requests. When using CRW's crawl endpoint, the rate limiting is handled for you.
2. Not Handling Errors
Websites go down, pages move, and structures change. Always check for errors in your scraping code:
# Bad: assumes everything works
data = requests.post(url, json=body).json()
content = data["data"]["markdown"]
# Good: handles failures
response = requests.post(url, json=body, timeout=30)
data = response.json()
if data.get("success"):
content = data["data"]["markdown"]
else:
print(f"Failed: {data.get('error', 'Unknown error')}")
3. Ignoring robots.txt
Check https://example.com/robots.txt before scraping. It tells you which pages the site owner prefers you don't access.
4. Storing Raw HTML Instead of Clean Content
Raw HTML is full of navigation, scripts, ads, and boilerplate. For most use cases (AI, search, analysis), you want clean text. CRW's markdown output solves this — it strips non-content elements automatically.
5. Not Using an API When One Exists
Many websites offer official APIs for their data. Check for an API first — it's more reliable, structured, and usually legal. Scraping is for when there's no better option.
Why CRW for Beginners?
CRW simplifies web scraping to its essentials: give it a URL, get clean content back. Here's why it's ideal for beginners:
- No selectors to learn: You don't need to know CSS selectors, XPath, or DOM traversal. CRW extracts the content automatically.
- No browser setup: No installing Chrome, managing drivers, or configuring headless mode. CRW handles JavaScript rendering server-side.
- Clean output: Markdown is readable and useful immediately. No HTML parsing or text cleaning required.
- Simple API: One HTTP POST request. Works from any language, any platform.
- Open source: Run it locally for free, inspect the code, no vendor lock-in.
- Fast: 833ms average latency. CRW is built in Rust and uses only 6.6 MB of RAM idle.
What to Build Next
Now that you know the basics, here are some beginner-friendly projects to practice:
- News aggregator: Scrape 3–5 news sites daily, save articles as markdown files, build a simple reading list
- Price tracker: Monitor a product page, extract the price, log it to a CSV, and chart the price over time
- Documentation search: Crawl a documentation site, index the content, and build a simple search tool
- AI knowledge base: Scrape a topic you're studying, feed the markdown into a RAG pipeline, and chat with the content
Self-Hosted vs. Cloud
| Self-Hosted CRW | fastCRW Cloud | |
|---|---|---|
| Setup | docker run -p 3000:3000 ghcr.io/us/crw:latest | Sign up at fastcrw.com |
| Cost | Free | Pay per request |
| API Key | Not needed locally | Required |
| Best for | Learning, high volume | Quick start, no Docker needed |
Conclusion
Web scraping doesn't have to be complicated. With CRW, you can go from zero to extracting clean content from any website with a single API call. Start with the basics — scrape a page, see the markdown output, try crawling a small site — and build from there.
For more advanced topics, explore our Python web scraping guide, JavaScript scraping guide, or learn how to build a RAG pipeline with scraped content.
Ready to start? Self-host CRW for free or sign up for fastCRW cloud.