What Is Web Scraping?

Web scraping is the process of automatically extracting data from websites. Instead of manually copying and pasting information, you write code (or use a tool) that visits web pages, reads the content, and saves what you need in a structured format.

Think of it this way: when you visit a webpage, your browser downloads HTML, CSS, and JavaScript, then renders it into the page you see. Web scraping does the same thing — downloads the page — but instead of rendering it visually, it extracts the text, links, images, or data you're interested in.

What Can You Do with Web Scraping?

Research: Collect data for market research, academic studies, or competitive analysis
Price monitoring: Track product prices across e-commerce sites
content aggregation: Build news feeds or knowledge bases from multiple sources
Lead generation: Gather business contact information from directories
AI and machine learning: Collect training data or build knowledge bases for RAG (Retrieval-Augmented Generation) systems
SEO analysis: Audit website content, meta tags, and link structures

How Web Scraping Works

Every web scraping process follows the same basic steps:

Request: Send an HTTP request to a URL (just like your browser does when you type a URL)
Receive: Get back the HTML content of the page
Parse: Read through the HTML to find the data you want
Extract: Pull out specific pieces of information (text, links, prices, etc.)
Store: Save the data somewhere useful (spreadsheet, database, file)

Static vs. Dynamic Websites

This is an important distinction for scraping:

Static websites: All content is in the HTML that the server sends. Simple to scrape — just download the HTML and parse it.
Dynamic websites (SPAs): Content is loaded by JavaScript after the page loads. Harder to scrape — you either need a browser engine that executes JavaScript, or you need a tool that handles this for you.

Most modern websites are at least partially dynamic. Social media feeds, e-commerce product pages, and dashboards almost always use JavaScript to load content. This is why simple HTTP requests often return empty or incomplete results — the JavaScript that populates the page hasn't run.

Legal and Ethical Considerations

Before scraping any website, understand the rules:

What's Generally Acceptable

Scraping publicly available data (no login required)
Respecting the site's robots.txt file (tells crawlers which pages to avoid)
Scraping at a reasonable rate (don't overload the server)
Using the data for personal research, non-commercial purposes, or with permission

What to Be Careful About

Terms of Service: Many sites explicitly prohibit scraping. Violating ToS can have legal consequences.
Personal data: Scraping personal information (names, emails, photos) may violate privacy laws like GDPR or CCPA.
Copyright: The content on websites is typically copyrighted. Scraping for republication can be infringement.
Rate limiting: Sending too many requests can be considered a denial-of-service attack.

Best Practices

Always check the site's robots.txt (e.g., https://example.com/robots.txt)
Add delays between requests (1–2 seconds minimum)
Identify your scraper with a descriptive User-Agent string
Prefer official APIs when available — many sites offer APIs specifically for data access
If in doubt, ask permission

Web Scraping Tools Overview

There are many ways to scrape the web. Here's a quick overview of the main categories:

1. Code Libraries

Write code in your preferred language to fetch and parse web pages:

Python: requests + Beautiful Soup, Scrapy
JavaScript: Cheerio, Puppeteer, Playwright
Go: Colly

Best for developers who want full control. Requires maintaining selectors and handling edge cases manually.

2. Browser Automation

Control a real browser programmatically to handle JavaScript-heavy sites:

Puppeteer: Chrome/Chromium automation (JavaScript)
Playwright: Cross-browser automation (JavaScript, Python, Java, .NET)
Selenium: The veteran — supports many browsers and languages

Best for dynamic sites that require interaction (clicking buttons, scrolling, filling forms). Heavy and slow compared to HTTP-based approaches.

3. No-Code Tools

Visual interfaces for building scrapers without coding:

Make.com: Visual workflow automation with HTTP modules
n8n: Open-source workflow automation
Apify: Cloud scraping platform with pre-built actors

Best for non-developers or quick prototypes. Limited flexibility compared to code.

4. Scraping APIs

Send a URL, get clean data back — the API handles rendering, parsing, and cleaning:

CRW / fastCRW: Open-source, Firecrawl-compatible, returns clean markdown
Firecrawl: Commercial scraping API
ScrapingBee, ScraperAPI: Proxy-based scraping services

Best for developers who want clean output without managing scraping infrastructure. The approach we'll focus on in this guide.

Hands-On: Your First Scrape with CRW

Let's get hands-on. We'll use CRW — an open-source web scraping API that returns clean markdown from any URL. You can self-host it or use the cloud version.

Option A: Self-Host CRW (Free)

Run CRW locally with Docker:

docker run -p 3000:3000 ghcr.io/us/crw:latest

CRW is now running at http://localhost:3000. That's it — no configuration, no API key needed for local use.

Option B: Use fastCRW Cloud

Scrape Your First Page

Let's scrape a page using curl (available on every operating system):

# Self-hosted (no API key needed)
curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

# fastCRW Cloud
curl -X POST https://api.fastcrw.com/v1/scrape \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

The response contains clean markdown — all the navigation bars, cookie banners, ads, and footers stripped away:

{
  "success": true,
  "data": {
    "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
    "metadata": {
      "title": "Example Domain",
      "sourceURL": "https://example.com",
      "description": "..."
    }
  }
}

Scrape with Python

import requests

# Change to http://localhost:3000 for self-hosted
CRW_URL = "https://api.fastcrw.com"

response = requests.post(
    f"{CRW_URL}/v1/scrape",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer fc-YOUR_API_KEY",
    },
    json={
        "url": "https://example.com",
        "formats": ["markdown"],
    },
)

data = response.json()
if data["success"]:
    print(f"Title: {data['data']['metadata']['title']}")
    print(f"Content:\n{data['data']['markdown']}")
else:
    print(f"Error: {data.get('error')}")

Scrape with JavaScript

const response = await fetch("https://api.fastcrw.com/v1/scrape", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": "Bearer fc-YOUR_API_KEY",
  },
  body: JSON.stringify({
    url: "https://example.com",
    formats: ["markdown"],
  }),
});

const data = await response.json();
if (data.success) {
  console.log(`Title: ${data.data.metadata.title}`);
  console.log(`Content: ${data.data.markdown}`);
}

Going Further: Crawl an Entire Website

Scraping one page at a time is useful, but often you need content from an entire site. CRW's /v1/crawl endpoint handles this:

# Start a crawl (async — returns immediately with a job ID)
curl -X POST https://api.fastcrw.com/v1/crawl \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 20,
    "scrapeOptions": { "formats": ["markdown"] }
  }'

# Response: {"success": true, "id": "crawl-abc123"}

# Check the status (repeat until status is "completed")
curl https://api.fastcrw.com/v1/crawl/crawl-abc123 \
  -H "Authorization: Bearer fc-YOUR_API_KEY"

When the crawl completes, you get an array of all pages with their markdown content. CRW automatically discovers linked pages and scrapes them — you don't need to find URLs manually.

Discover Pages Before Scraping

Sometimes you want to see what pages exist before deciding what to scrape. The /v1/map endpoint does exactly this:

curl -X POST https://api.fastcrw.com/v1/map \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -d '{"url": "https://docs.example.com"}'

# Response: {"success": true, "links": ["https://docs.example.com/intro", "https://docs.example.com/api", ...]}

Map returns URLs without fetching content — much faster than a full crawl. Use it to preview a site's structure, filter to relevant pages, then scrape only what you need.

Extract Structured Data

Getting raw text is a great start, but sometimes you need structured data — product names, prices, dates, contact information. CRW's /v1/extract endpoint uses AI to pull structured data from pages:

curl -X POST https://api.fastcrw.com/v1/extract \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -d '{
    "urls": ["https://example.com/product"],
    "prompt": "Extract the product name, price, and whether it is in stock",
    "schema": {
      "type": "object",
      "properties": {
        "product_name": { "type": "string" },
        "price": { "type": "string" },
        "in_stock": { "type": "boolean" }
      }
    }
  }'

You describe what you want in plain English, optionally provide a JSON schema, and CRW returns structured data. No CSS selectors, no XPath, no regex.

Common Beginner Mistakes

1. Scraping Too Fast

Sending hundreds of requests per second will get your IP blocked and potentially cause issues for the website. Always add delays between requests. When using CRW's crawl endpoint, the rate limiting is handled for you.

2. Not Handling Errors

Websites go down, pages move, and structures change. Always check for errors in your scraping code:

# Bad: assumes everything works
data = requests.post(url, json=body).json()
content = data["data"]["markdown"]

# Good: handles failures
response = requests.post(url, json=body, timeout=30)
data = response.json()
if data.get("success"):
    content = data["data"]["markdown"]
else:
    print(f"Failed: {data.get('error', 'Unknown error')}")

3. Ignoring robots.txt

Check https://example.com/robots.txt before scraping. It tells you which pages the site owner prefers you don't access.

4. Storing Raw HTML Instead of Clean Content

Raw HTML is full of navigation, scripts, ads, and boilerplate. For most use cases (AI, search, analysis), you want clean text. CRW's markdown output solves this — it strips non-content elements automatically.

5. Not Using an API When One Exists

Many websites offer official APIs for their data. Check for an API first — it's more reliable, structured, and usually legal. Scraping is for when there's no better option.

Why CRW for Beginners?

CRW simplifies web scraping to its essentials: give it a URL, get clean content back. Here's why it's ideal for beginners:

No selectors to learn: You don't need to know CSS selectors, XPath, or DOM traversal. CRW extracts the content automatically.
No browser setup: No installing Chrome, managing drivers, or configuring headless mode. CRW handles JavaScript rendering server-side.
Clean output: Markdown is readable and useful immediately. No HTML parsing or text cleaning required.
Simple API: One HTTP POST request. Works from any language, any platform.
Open source: Run it locally for free, inspect the code, no vendor lock-in.
Lightweight and local-first: Built in Rust as a single small static binary with low latency when run next to your code.

What to Build Next

Now that you know the basics, here are some beginner-friendly projects to practice:

News aggregator: Scrape 3–5 news sites daily, save articles as markdown files, build a simple reading list
Price tracker: Monitor a product page, extract the price, log it to a CSV, and chart the price over time
Documentation search: Crawl a documentation site, index the content, and build a simple search tool
AI knowledge base: Scrape a topic you're studying, feed the markdown into a RAG pipeline, and chat with the content

Self-Hosted vs. Cloud

	Self-Hosted CRW	fastCRW Cloud
Setup	`docker run -p 3000:3000 ghcr.io/us/crw:latest`	Sign up at fastcrw.com
Cost	Free	Pay per request
API Key	Not needed locally	Required
Best for	Learning, high volume	Quick start, no Docker needed

Conclusion

Web scraping doesn't have to be complicated. With CRW, you can go from zero to extracting clean content from any website with a single API call. Start with the basics — scrape a page, see the markdown output, try crawling a small site — and build from there.

For more advanced topics, explore our Python web scraping guide, JavaScript scraping guide, or learn how to build a RAG pipeline with scraped content.

Ready to start? Self-host CRW for free or sign up for fastCRW cloud.

Web Scraping for Beginners: From Zero to Production (2026)