Tutorial

Web Scraping for Beginners: From Zero to Production (2026)

Beginner-friendly introduction to web scraping — what it is, how it works, legal considerations, tools overview, and hands-on examples with CRW's API.

[Fast]
C
R
W
March 29, 202618 min read

What Is Web Scraping?

Web scraping is the process of automatically extracting data from websites. Instead of manually copying and pasting information, you write code (or use a tool) that visits web pages, reads the content, and saves what you need in a structured format.

Think of it this way: when you visit a webpage, your browser downloads HTML, CSS, and JavaScript, then renders it into the page you see. Web scraping does the same thing — downloads the page — but instead of rendering it visually, it extracts the text, links, images, or data you're interested in.

What Can You Do with Web Scraping?

  • Research: Collect data for market research, academic studies, or competitive analysis
  • Price monitoring: Track product prices across e-commerce sites
  • Content aggregation: Build news feeds or knowledge bases from multiple sources
  • Lead generation: Gather business contact information from directories
  • AI and machine learning: Collect training data or build knowledge bases for RAG (Retrieval-Augmented Generation) systems
  • SEO analysis: Audit website content, meta tags, and link structures

How Web Scraping Works

Every web scraping process follows the same basic steps:

  1. Request: Send an HTTP request to a URL (just like your browser does when you type a URL)
  2. Receive: Get back the HTML content of the page
  3. Parse: Read through the HTML to find the data you want
  4. Extract: Pull out specific pieces of information (text, links, prices, etc.)
  5. Store: Save the data somewhere useful (spreadsheet, database, file)

Static vs. Dynamic Websites

This is an important distinction for scraping:

  • Static websites: All content is in the HTML that the server sends. Simple to scrape — just download the HTML and parse it.
  • Dynamic websites (SPAs): Content is loaded by JavaScript after the page loads. Harder to scrape — you either need a browser engine that executes JavaScript, or you need a tool that handles this for you.

Most modern websites are at least partially dynamic. Social media feeds, e-commerce product pages, and dashboards almost always use JavaScript to load content. This is why simple HTTP requests often return empty or incomplete results — the JavaScript that populates the page hasn't run.

Legal and Ethical Considerations

Before scraping any website, understand the rules:

What's Generally Acceptable

  • Scraping publicly available data (no login required)
  • Respecting the site's robots.txt file (tells crawlers which pages to avoid)
  • Scraping at a reasonable rate (don't overload the server)
  • Using the data for personal research, non-commercial purposes, or with permission

What to Be Careful About

  • Terms of Service: Many sites explicitly prohibit scraping. Violating ToS can have legal consequences.
  • Personal data: Scraping personal information (names, emails, photos) may violate privacy laws like GDPR or CCPA.
  • Copyright: The content on websites is typically copyrighted. Scraping for republication can be infringement.
  • Rate limiting: Sending too many requests can be considered a denial-of-service attack.

Best Practices

  • Always check the site's robots.txt (e.g., https://example.com/robots.txt)
  • Add delays between requests (1–2 seconds minimum)
  • Identify your scraper with a descriptive User-Agent string
  • Prefer official APIs when available — many sites offer APIs specifically for data access
  • If in doubt, ask permission

Web Scraping Tools Overview

There are many ways to scrape the web. Here's a quick overview of the main categories:

1. Code Libraries

Write code in your preferred language to fetch and parse web pages:

  • Python: requests + Beautiful Soup, Scrapy
  • JavaScript: Cheerio, Puppeteer, Playwright
  • Go: Colly

Best for developers who want full control. Requires maintaining selectors and handling edge cases manually.

2. Browser Automation

Control a real browser programmatically to handle JavaScript-heavy sites:

  • Puppeteer: Chrome/Chromium automation (JavaScript)
  • Playwright: Cross-browser automation (JavaScript, Python, Java, .NET)
  • Selenium: The veteran — supports many browsers and languages

Best for dynamic sites that require interaction (clicking buttons, scrolling, filling forms). Heavy and slow compared to HTTP-based approaches.

3. No-Code Tools

Visual interfaces for building scrapers without coding:

  • Make.com: Visual workflow automation with HTTP modules
  • n8n: Open-source workflow automation
  • Apify: Cloud scraping platform with pre-built actors

Best for non-developers or quick prototypes. Limited flexibility compared to code.

4. Scraping APIs

Send a URL, get clean data back — the API handles rendering, parsing, and cleaning:

  • CRW / fastCRW: Open-source, Firecrawl-compatible, returns clean markdown
  • Firecrawl: Commercial scraping API
  • ScrapingBee, ScraperAPI: Proxy-based scraping services

Best for developers who want clean output without managing scraping infrastructure. The approach we'll focus on in this guide.

Hands-On: Your First Scrape with CRW

Let's get hands-on. We'll use CRW — an open-source web scraping API that returns clean markdown from any URL. You can self-host it or use the cloud version.

Option A: Self-Host CRW (Free)

Run CRW locally with Docker:

docker run -p 3000:3000 ghcr.io/us/crw:latest

CRW is now running at http://localhost:3000. That's it — no configuration, no API key needed for local use.

Option B: Use fastCRW Cloud

Sign up at fastcrw.com and get an API key. No infrastructure to manage.

Scrape Your First Page

Let's scrape a page using curl (available on every operating system):

# Self-hosted (no API key needed)
curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

# fastCRW Cloud
curl -X POST https://fastcrw.com/api/v1/scrape \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

The response contains clean markdown — all the navigation bars, cookie banners, ads, and footers stripped away:

{
  "success": true,
  "data": {
    "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
    "metadata": {
      "title": "Example Domain",
      "sourceURL": "https://example.com",
      "description": "..."
    }
  }
}

Scrape with Python

import requests

# Change to http://localhost:3000 for self-hosted
CRW_URL = "https://fastcrw.com/api"

response = requests.post(
    f"{CRW_URL}/v1/scrape",
    headers={
        "Content-Type": "application/json",
        "Authorization": "Bearer fc-YOUR_API_KEY",
    },
    json={
        "url": "https://example.com",
        "formats": ["markdown"],
    },
)

data = response.json()
if data["success"]:
    print(f"Title: {data['data']['metadata']['title']}")
    print(f"Content:\n{data['data']['markdown']}")
else:
    print(f"Error: {data.get('error')}")

Scrape with JavaScript

const response = await fetch("https://fastcrw.com/api/v1/scrape", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": "Bearer fc-YOUR_API_KEY",
  },
  body: JSON.stringify({
    url: "https://example.com",
    formats: ["markdown"],
  }),
});

const data = await response.json();
if (data.success) {
  console.log(`Title: ${data.data.metadata.title}`);
  console.log(`Content: ${data.data.markdown}`);
}

Going Further: Crawl an Entire Website

Scraping one page at a time is useful, but often you need content from an entire site. CRW's /v1/crawl endpoint handles this:

# Start a crawl (async — returns immediately with a job ID)
curl -X POST https://fastcrw.com/api/v1/crawl \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 20,
    "scrapeOptions": { "formats": ["markdown"] }
  }'

# Response: {"success": true, "id": "crawl-abc123"}

# Check the status (repeat until status is "completed")
curl https://fastcrw.com/api/v1/crawl/crawl-abc123 \
  -H "Authorization: Bearer fc-YOUR_API_KEY"

When the crawl completes, you get an array of all pages with their markdown content. CRW automatically discovers linked pages and scrapes them — you don't need to find URLs manually.

Discover Pages Before Scraping

Sometimes you want to see what pages exist before deciding what to scrape. The /v1/map endpoint does exactly this:

curl -X POST https://fastcrw.com/api/v1/map \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -d '{"url": "https://docs.example.com"}'

# Response: {"success": true, "links": ["https://docs.example.com/intro", "https://docs.example.com/api", ...]}

Map returns URLs without fetching content — much faster than a full crawl. Use it to preview a site's structure, filter to relevant pages, then scrape only what you need.

Extract Structured Data

Getting raw text is a great start, but sometimes you need structured data — product names, prices, dates, contact information. CRW's /v1/extract endpoint uses AI to pull structured data from pages:

curl -X POST https://fastcrw.com/api/v1/extract \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer fc-YOUR_API_KEY" \
  -d '{
    "urls": ["https://example.com/product"],
    "prompt": "Extract the product name, price, and whether it is in stock",
    "schema": {
      "type": "object",
      "properties": {
        "product_name": { "type": "string" },
        "price": { "type": "string" },
        "in_stock": { "type": "boolean" }
      }
    }
  }'

You describe what you want in plain English, optionally provide a JSON schema, and CRW returns structured data. No CSS selectors, no XPath, no regex.

Common Beginner Mistakes

1. Scraping Too Fast

Sending hundreds of requests per second will get your IP blocked and potentially cause issues for the website. Always add delays between requests. When using CRW's crawl endpoint, the rate limiting is handled for you.

2. Not Handling Errors

Websites go down, pages move, and structures change. Always check for errors in your scraping code:

# Bad: assumes everything works
data = requests.post(url, json=body).json()
content = data["data"]["markdown"]

# Good: handles failures
response = requests.post(url, json=body, timeout=30)
data = response.json()
if data.get("success"):
    content = data["data"]["markdown"]
else:
    print(f"Failed: {data.get('error', 'Unknown error')}")

3. Ignoring robots.txt

Check https://example.com/robots.txt before scraping. It tells you which pages the site owner prefers you don't access.

4. Storing Raw HTML Instead of Clean Content

Raw HTML is full of navigation, scripts, ads, and boilerplate. For most use cases (AI, search, analysis), you want clean text. CRW's markdown output solves this — it strips non-content elements automatically.

5. Not Using an API When One Exists

Many websites offer official APIs for their data. Check for an API first — it's more reliable, structured, and usually legal. Scraping is for when there's no better option.

Why CRW for Beginners?

CRW simplifies web scraping to its essentials: give it a URL, get clean content back. Here's why it's ideal for beginners:

  • No selectors to learn: You don't need to know CSS selectors, XPath, or DOM traversal. CRW extracts the content automatically.
  • No browser setup: No installing Chrome, managing drivers, or configuring headless mode. CRW handles JavaScript rendering server-side.
  • Clean output: Markdown is readable and useful immediately. No HTML parsing or text cleaning required.
  • Simple API: One HTTP POST request. Works from any language, any platform.
  • Open source: Run it locally for free, inspect the code, no vendor lock-in.
  • Fast: 833ms average latency. CRW is built in Rust and uses only 6.6 MB of RAM idle.

What to Build Next

Now that you know the basics, here are some beginner-friendly projects to practice:

  • News aggregator: Scrape 3–5 news sites daily, save articles as markdown files, build a simple reading list
  • Price tracker: Monitor a product page, extract the price, log it to a CSV, and chart the price over time
  • Documentation search: Crawl a documentation site, index the content, and build a simple search tool
  • AI knowledge base: Scrape a topic you're studying, feed the markdown into a RAG pipeline, and chat with the content

Self-Hosted vs. Cloud

Self-Hosted CRWfastCRW Cloud
Setupdocker run -p 3000:3000 ghcr.io/us/crw:latestSign up at fastcrw.com
CostFreePay per request
API KeyNot needed locallyRequired
Best forLearning, high volumeQuick start, no Docker needed

Conclusion

Web scraping doesn't have to be complicated. With CRW, you can go from zero to extracting clean content from any website with a single API call. Start with the basics — scrape a page, see the markdown output, try crawling a small site — and build from there.

For more advanced topics, explore our Python web scraping guide, JavaScript scraping guide, or learn how to build a RAG pipeline with scraped content.

Ready to start? Self-host CRW for free or sign up for fastCRW cloud.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.