Skip to main content
Use Cases/Use Case / LLM Training Data

Web Scraping for LLM Training Data

Use fastCRW to crawl domains into markdown, deduplicate, filter quality, and output JSONL for fine-tuning and RAG datasets.

Published
May 12, 2026
Updated
May 12, 2026
Category
use cases
Verdict

fastCRW excels at turning web content into fine-tuning and RAG datasets. Crawl a domain into clean markdown, deduplicate pages, filter for quality and relevance, and output JSONL—the standard format for HuggingFace, OpenAI fine-tuning, and LLM training. You get a dataset pipeline that respects page deduplication, filters out boilerplate and spam, and structures content for effective fine-tuning. The bottleneck shifts from crawling to quality filtering—fastCRW handles the first part at scale.

Crawl entire domains into clean markdown with automatic deduplicationFilter low-quality pages by length, keyword density, and relevanceOutput JSONL for OpenAI fine-tuning, HuggingFace, or Anthropic APIs

Verdict

fastCRW is the fastest path from web content to LLM training data. Crawl a domain into clean markdown, deduplicate and filter pages automatically, then output JSONL ready for OpenAI, HuggingFace, or Anthropic APIs. You get a repeatable dataset pipeline that scales from 100 to 1M+ pages. The hard part isn't scraping—it's quality filtering. fastCRW handles both; you focus on training and evaluation. Fine-tuning on scraped domain data yields models that are far better than base models on specialized topics.


Why LLM Training Needs Web Scraping

Base LLMs (GPT-4, Claude, Llama) are generalists trained on broad internet data. They often perform poorly on specialized domains:

  • Technical docs: Generic LLMs hallucinate API signatures and parameter names.
  • Legal writing: Domain jargon and precedent matter; base models miss nuances.
  • Medical information: Base models are too slow, too verbose, miss critical details.
  • Internal knowledge: Your company's processes, codebase, policies—not in training data.

Fine-tuning on domain-specific data teaches models to:

  • adopt your writing style and terminology,
  • follow your company's processes,
  • prioritize accuracy on domain tasks,
  • and reduce hallucinations in narrow domains.

Web scraping lets you build fine-tuning datasets at scale from public sources (documentation, tutorials, open-source code) or your own content (docs, blogs, internal wikis).


Where fastCRW Fits

StagefastCRW Role
Data collectioncrawl entire domain into markdown
CleaningMarkdown output removes HTML boilerplate automatically
DeduplicationRemove exact + near-duplicate pages
FilteringQuality heuristics (length, keyword relevance, noise)
FormattingOutput JSONL for fine-tuning APIs

fastCRW handles the first two; your code handles filtering and formatting.


Architecture Overview

A typical LLM training dataset pipeline has six stages:

  1. Crawl: Fetch all pages from target domain(s) as markdown.
  2. Deduplicate: Remove exact duplicates (MD5) and near-duplicates (Levenshtein).
  3. Filter: Remove short, noisy, or irrelevant pages using heuristics.
  4. Segment: Split long documents into chunks suitable for training.
  5. Structure: Format as JSONL with prompt/completion or instruction/input/output fields.
  6. Upload: Load into fine-tuning API or training framework.

fastCRW handles crawl and deduplication; your code handles filtering through formatting.


Implementation Walkthrough

Here's a complete Python example that crawls a domain, deduplicates pages, filters for quality, and outputs JSONL ready for OpenAI fine-tuning.

Step 1: Install dependencies

uv venv
uv pip install requests python-dotenv

Step 2: Crawl domain and deduplicate

import json
import os
import requests
import hashlib
import time
import re
from typing import Optional
from datetime import datetime
from difflib import SequenceMatcher

# Load API key
FASTCRW_API_KEY = os.getenv("FASTCRW_API_KEY")
FASTCRW_BASE_URL = "https://api.fastcrw.com/v1"

def crawl_domain(domain: str, max_depth: int = 3, max_pages: int = 1000) -> list[dict]:
    """
    Crawl an entire domain and return pages as markdown.
    
    Args:
        domain: Domain to crawl (e.g., https://docs.example.com)
        max_depth: Max crawl depth (3 = ~1000 pages for medium sites)
        max_pages: Max pages to crawl (safeguard against infinite crawls)
    
    Returns:
        List of crawled pages with markdown content
    """
    print(f"Crawling domain: {domain}")
    
    crawl_payload = {
        "url": domain,
        "maxDepth": max_depth,
        "maxPages": max_pages,
        "formats": ["markdown"],  # Clean markdown, no HTML
    }
    
    response = requests.post(
        f"{FASTCRW_BASE_URL}/crawl",
        headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
        json=crawl_payload,
        timeout=120
    )
    response.raise_for_status()
    
    crawl_result = response.json()
    
    # Extract pages from crawl result
    pages = []
    if "data" in crawl_result:
        for item in crawl_result["data"]:
            page = {
                "url": item.get("url"),
                "content": item.get("markdown", ""),
                "title": item.get("title", ""),
                "crawled_at": datetime.utcnow().isoformat()
            }
            pages.append(page)
    
    return pages

def compute_content_hash(content: str) -> str:
    """
    Compute MD5 hash of content for deduplication.
    
    Args:
        content: Text content
    
    Returns:
        MD5 hash hex string
    """
    return hashlib.md5(content.encode()).hexdigest()

def similarity(a: str, b: str) -> float:
    """
    Compute similarity between two strings (0-1).
    
    Args:
        a: First string
        b: Second string
    
    Returns:
        Similarity score (1 = identical, 0 = different)
    """
    return SequenceMatcher(None, a, b).ratio()

def deduplicate_pages(pages: list[dict], threshold: float = 0.90) -> list[dict]:
    """
    Remove duplicate and near-duplicate pages.
    
    Uses exact match (MD5) first, then fuzzy match (Levenshtein).
    Keeps first occurrence, removes later duplicates.
    
    Args:
        pages: List of page records
        threshold: Similarity threshold for near-duplicates (default 0.90)
    
    Returns:
        Deduplicated list of pages
    """
    seen_hashes = set()
    seen_contents = []
    unique_pages = []
    
    for page in pages:
        content = page.get("content", "").strip()
        
        if not content:
            continue
        
        # Check for exact duplicate (MD5 hash)
        content_hash = compute_content_hash(content)
        if content_hash in seen_hashes:
            continue
        
        # Check for near-duplicate (90%+ similarity)
        is_duplicate = False
        for seen_content in seen_contents:
            sim = similarity(content[:1000], seen_content[:1000])  # Compare first 1000 chars for speed
            if sim >= threshold:
                is_duplicate = True
                break
        
        if not is_duplicate:
            seen_hashes.add(content_hash)
            seen_contents.append(content)
            unique_pages.append(page)
    
    return unique_pages

def filter_quality(pages: list[dict]) -> list[dict]:
    """
    Filter pages by quality heuristics.
    
    Removes:
    - Pages <500 tokens (too short)
    - Pages >10K tokens (likely noise/listing pages)
    - Pages with <30% unique words (boilerplate-heavy)
    - Pages with low content density (<1 sentence per 50 words)
    
    Args:
        pages: List of page records
    
    Returns:
        Filtered list of quality pages
    """
    filtered = []
    
    for page in pages:
        content = page.get("content", "").strip()
        
        # Minimum length: 500 tokens (~3000 chars)
        if len(content) < 500:
            continue
        
        # Maximum length: 10K tokens (~60K chars)
        if len(content) > 60000:
            continue
        
        # Calculate token estimate (rough: 1 token ≈ 4 chars)
        token_count = len(content) / 4
        
        # Uniqueness: count unique words
        words = re.findall(r'\b\w+\b', content.lower())
        unique_words = len(set(words))
        uniqueness = unique_words / len(words) if words else 0
        
        # Too much boilerplate (navigation, footers): <30% unique
        if uniqueness < 0.30:
            continue
        
        # Sentence count (estimate: sentences end with . ! ?)
        sentences = len(re.findall(r'[.!?]+', content))
        
        # Too sparse (navigation heavy): fewer than 1 sentence per 50 words
        if sentences > 0 and len(words) / sentences > 50:
            continue
        
        # Passed all filters
        page["token_count"] = int(token_count)
        page["uniqueness"] = round(uniqueness, 2)
        filtered.append(page)
    
    return filtered

def chunk_long_content(pages: list[dict], max_tokens: int = 2000) -> list[dict]:
    """
    Split long pages into chunks for training.
    
    Preserves semantic breaks (paragraphs) when possible.
    
    Args:
        pages: List of page records
        max_tokens: Max tokens per chunk (~4 chars per token)
    
    Returns:
        Chunked pages
    """
    chunked = []
    max_chars = max_tokens * 4
    
    for page in pages:
        content = page.get("content", "")
        url = page.get("url", "")
        
        # If content is short enough, keep as-is
        if len(content) <= max_chars:
            chunked.append(page)
            continue
        
        # Split by paragraphs (double newlines)
        paragraphs = content.split("\n\n")
        
        current_chunk = ""
        chunk_num = 1
        
        for para in paragraphs:
            # If adding this paragraph exceeds max, save current chunk and start new
            if len(current_chunk) + len(para) > max_chars and current_chunk:
                chunk_page = {
                    "url": f"{url}#chunk_{chunk_num}",
                    "content": current_chunk.strip(),
                    "title": f"{page.get('title', '')} (Part {chunk_num})",
                    "crawled_at": page.get("crawled_at"),
                    "token_count": int(len(current_chunk) / 4)
                }
                chunked.append(chunk_page)
                
                current_chunk = ""
                chunk_num += 1
            
            current_chunk += para + "\n\n"
        
        # Add remaining chunk
        if current_chunk:
            chunk_page = {
                "url": f"{url}#chunk_{chunk_num}",
                "content": current_chunk.strip(),
                "title": f"{page.get('title', '')} (Part {chunk_num})" if chunk_num > 1 else page.get('title', ''),
                "crawled_at": page.get("crawled_at"),
                "token_count": int(len(current_chunk) / 4)
            }
            chunked.append(chunk_page)
    
    return chunked

def format_as_jsonl(pages: list[dict]) -> str:
    """
    Format pages as JSONL for OpenAI fine-tuning.
    
    Uses {"prompt": "...", "completion": "..."} format.
    For documentation/knowledge base, prompt is title+URL, completion is content.
    
    Args:
        pages: List of page records
    
    Returns:
        JSONL string (one JSON object per line)
    """
    jsonl_lines = []
    
    for page in pages:
        # Create a training example
        prompt = f"Title: {page.get('title', 'Untitled')}\nURL: {page.get('url', '')}\n\nContent:"
        completion = f"\n{page.get('content', '')}"
        
        example = {
            "prompt": prompt,
            "completion": completion
        }
        
        jsonl_lines.append(json.dumps(example))
    
    return "\n".join(jsonl_lines)

def format_as_instruction_jsonl(pages: list[dict]) -> str:
    """
    Format pages as instruction-following JSONL.
    
    Uses {"messages": [{"role": "user", "content": "..."}, ...]} format
    for Anthropic/OpenAI chat fine-tuning.
    
    Args:
        pages: List of page records
    
    Returns:
        JSONL string
    """
    jsonl_lines = []
    
    for page in pages:
        # Create an instruction-following example
        example = {
            "messages": [
                {
                    "role": "user",
                    "content": f"Explain: {page.get('title', 'topic')}"
                },
                {
                    "role": "assistant",
                    "content": page.get('content', '')
                }
            ]
        }
        
        jsonl_lines.append(json.dumps(example))
    
    return "\n".join(jsonl_lines)

# Example: Main training data pipeline
if __name__ == "__main__":
    print("LLM Training Data Scraping Pipeline")
    print("-" * 50)
    
    # Step 1: Crawl domain
    domain = "https://docs.fastcrw.com"  # Example: crawl fastCRW docs
    pages = crawl_domain(domain, max_depth=3, max_pages=500)
    
    print(f"\nCrawled {len(pages)} pages")
    
    # Step 2: Deduplicate
    unique_pages = deduplicate_pages(pages, threshold=0.90)
    print(f"After dedup: {len(unique_pages)} unique pages")
    
    # Step 3: Filter for quality
    quality_pages = filter_quality(unique_pages)
    print(f"After quality filter: {len(quality_pages)} pages")
    
    # Show statistics
    tokens = [p.get("token_count", 0) for p in quality_pages]
    total_tokens = sum(tokens)
    print(f"Total tokens in dataset: {total_tokens:,} (~{int(total_tokens / 1000)}K)")
    
    # Step 4: Chunk long content
    chunked_pages = chunk_long_content(quality_pages, max_tokens=2000)
    print(f"After chunking: {len(chunked_pages)} chunks")
    
    # Step 5: Format as JSONL
    jsonl_output = format_as_jsonl(chunked_pages)
    
    # Save to file
    with open("training_data.jsonl", "w") as f:
        f.write(jsonl_output)
    
    print(f"\nSaved to training_data.jsonl")
    
    # Example: Alternative instruction format for chat fine-tuning
    instruction_jsonl = format_as_instruction_jsonl(chunked_pages)
    with open("training_data_instruction.jsonl", "w") as f:
        f.write(instruction_jsonl)
    
    print(f"Also saved instruction format to training_data_instruction.jsonl")
    
    # Show first example
    if chunked_pages:
        print("\nExample training data point:")
        example = json.loads(jsonl_output.split("\n")[0])
        print(f"Prompt length: {len(example['prompt'])} chars")
        print(f"Completion length: {len(example['completion'])} chars")

Step 3: Run the pipeline

export FASTCRW_API_KEY="your_api_key_here"
uv run python scrape_training_data.py

This creates training_data.jsonl (OpenAI fine-tuning format) and training_data_instruction.jsonl (chat format) ready to upload.

Step 4: Upload to OpenAI fine-tuning

openai api fine_tunes.create -t training_data.jsonl -m gpt-3.5-turbo

Or use HuggingFace Datasets:

from datasets import Dataset

data = [json.loads(line) for line in open("training_data.jsonl")]
dataset = Dataset.from_dict({
    "text": [item["prompt"] + item["completion"] for item in data]
})
dataset.push_to_hub("your-username/domain-dataset")

Production Considerations

Deduplication at Scale

For very large crawls (100K+ pages), exact hash matching is fast, but pairwise fuzzy matching is O(n²). Use approximate matching:

  • MinHash: Fast approximate deduplication for very large corpora.
  • Bloom filters: Space-efficient set membership for hashes.
  • Locality-sensitive hashing: Groups similar content without comparing all pairs.

For datasets under 100K pages, the Python code above is sufficient.

Quality Filtering

The heuristics above (length, uniqueness, sparsity) are good starting points, but domain-specific filtering is better:

  • Technical docs: Favor pages with code examples.
  • Legal docs: Favor pages with specific section headers.
  • Blog content: Favor pages with dates, author info, detailed explanations.

Sample your filtered dataset manually to calibrate filters.

Chunking Strategy

Long documents (>2K tokens) need splitting. Break at semantic boundaries when possible:

  • Documentation: Split by section headers (## Level 2 headings)
  • Articles: Split by paragraphs or sentences
  • Code: Split by function definitions
  • Books: Split by chapters

fastCRW's markdown output preserves structure, making semantic chunking easier.

Fine-tuning Evaluation

Always measure improvement:

  1. Create a held-out test set (10-20% of data).
  2. Fine-tune a model on the training set.
  3. Evaluate on domain-specific tasks: accuracy, relevance, style match.
  4. Compare to baseline (non-fine-tuned model).
  5. A/B test with real users if possible.

Small high-quality datasets (100–1,000 examples) often beat large mediocre ones (100K examples).


Only scrape content you own or have permission to use:

  • Your own content: Blogs, docs, wikis—OK to scrape.
  • Open-source projects: Documentation with open licenses (MIT, Apache)—OK to scrape.
  • Creative Commons content: Only if your use respects the license (e.g., CC-BY requires attribution).
  • Paywalled content: Books, news behind paywalls—not OK.
  • Copyrighted material: Articles, tutorials without explicit permission—risky.

Check the license before scraping. When in doubt, contact the author or use an official API.

Ethical Training Data

High-quality training data should:

  • Represent diverse perspectives: Avoid narrow sources that skew your model.
  • Exclude toxic content: Remove hate speech, spam, misinformation.
  • Respect privacy: Don't scrape personal data or private communications.
  • Attribute sources: Document where your training data comes from.

Training on biased data produces biased models. Audit your dataset.


FAQ

Q: What's the minimum dataset size for meaningful fine-tuning?

A: Depends on model size and task. For small models (Llama 2 7B), 100 high-quality examples can improve performance. For large models (GPT-4), 50–100 examples are often sufficient. Quality matters far more than quantity. Experiment with your domain.

Q: How do I deduplicate pages that use templates but have different data?

A: Templated pages (product listings, search results) have identical structure but different content. Use content-based deduplication (Levenshtein distance) rather than structure. fastCRW outputs markdown, which makes this easier—you're comparing text, not HTML.

Q: Can I fine-tune on multiple domains?

A: Yes. Combine JSONL files from different domains. But be aware: multi-domain fine-tuning may cause the model to lose specificity. Train separate models per domain for best results, or use RAG (Retrieval-Augmented Generation) with separate knowledge bases.

Q: What about updating my training data over time?

A: Re-crawl regularly (monthly or quarterly) and retrain the model. Or use incremental fine-tuning (start from previous fine-tuned model, not base model). This keeps your model current without expensive retraining from scratch.

Q: How do I handle pages that are mostly code?

A: Code-heavy pages are valuable for code generation and technical Q&A. Keep them. Use keyword filtering to remove documentation spam (autogenerated API docs with no explanation), but preserve actual tutorials and code examples.

Q: Should I filter for English-only content?

A: Depends on your use case. If you're fine-tuning a model for English only, filtering for language is good. Use a language detection library (langdetect, FastText) to identify non-English pages and exclude them.

Q: How do I benchmark my fine-tuned model?

A: Create a domain-specific benchmark task (e.g., "answer 50 common questions in your domain"). Evaluate both base and fine-tuned models. Compare accuracy, relevance, response time, and cost. High-quality fine-tuning should show 20-50% improvement.

Continue exploring

More from Use Cases

View all use cases

Related hubs

Keep the crawl path moving