Use Cases/Use Case / LLM Training Data

Web Scraping for LLM Training Data

Use fastCRW to crawl domains into markdown, deduplicate, filter quality, and output JSONL for fine-tuning and RAG datasets.

Published

May 12, 2026

Updated

May 12, 2026

Verdict

fastCRW is the fastest path from web content to LLM training data. Crawl a domain into clean markdown, deduplicate and filter pages automatically, then output JSONL ready for OpenAI, HuggingFace, or Anthropic APIs. You get a repeatable dataset pipeline that scales from 100 to 1M+ pages. The hard part isn't scraping—it's quality filtering. fastCRW handles both; you focus on training and evaluation. Fine-tuning on scraped domain data yields models that are far better than base models on specialized topics.

Why LLM Training Needs Web Scraping

Base LLMs (GPT-4, Claude, Llama) are generalists trained on broad internet data. They often perform poorly on specialized domains:

Technical docs: Generic LLMs hallucinate API signatures and parameter names.
Legal writing: Domain jargon and precedent matter; base models miss nuances.
Medical information: Base models are too slow, too verbose, miss critical details.
Internal knowledge: Your company's processes, codebase, policies—not in training data.

Fine-tuning on domain-specific data teaches models to:

adopt your writing style and terminology,
follow your company's processes,
prioritize accuracy on domain tasks,
and reduce hallucinations in narrow domains.

Web scraping lets you build fine-tuning datasets at scale from public sources (documentation, tutorials, open-source code) or your own content (docs, blogs, internal wikis).

Where fastCRW Fits

Stage	fastCRW Role
Data collection	`crawl` entire domain into markdown
Cleaning	Markdown output removes HTML boilerplate automatically
Deduplication	Remove exact + near-duplicate pages
Filtering	Quality heuristics (length, keyword relevance, noise)
Formatting	Output JSONL for fine-tuning APIs

fastCRW handles the first two; your code handles filtering and formatting.

Architecture Overview

A typical LLM training dataset pipeline has six stages:

Crawl: Fetch all pages from target domain(s) as markdown.
Deduplicate: Remove exact duplicates (MD5) and near-duplicates (Levenshtein).
Filter: Remove short, noisy, or irrelevant pages using heuristics.
Segment: Split long documents into chunks suitable for training.
Structure: Format as JSONL with prompt/completion or instruction/input/output fields.
Upload: Load into fine-tuning API or training framework.

fastCRW handles crawl and deduplication; your code handles filtering through formatting.

Implementation Walkthrough

Here's a complete Python example that crawls a domain, deduplicates pages, filters for quality, and outputs JSONL ready for OpenAI fine-tuning.

Step 1: Install dependencies

uv venv
uv pip install requests python-dotenv

Step 2: Crawl domain and deduplicate

import json
import os
import requests
import hashlib
import time
import re
from typing import Optional
from datetime import datetime
from difflib import SequenceMatcher

# Load API key
FASTCRW_API_KEY = os.getenv("FASTCRW_API_KEY")
FASTCRW_BASE_URL = "https://api.fastcrw.com/v1"

def crawl_domain(domain: str, max_depth: int = 3, max_pages: int = 1000) -> list[dict]:
    """
    Crawl an entire domain and return pages as markdown.
    
    Args:
        domain: Domain to crawl (e.g., https://docs.example.com)
        max_depth: Max crawl depth (3 = ~1000 pages for medium sites)
        max_pages: Max pages to crawl (safeguard against infinite crawls)
    
    Returns:
        List of crawled pages with markdown content
    """
    print(f"Crawling domain: {domain}")
    
    crawl_payload = {
        "url": domain,
        "maxDepth": max_depth,
        "maxPages": max_pages,
        "formats": ["markdown"],  # Clean markdown, no HTML
    }
    
    response = requests.post(
        f"{FASTCRW_BASE_URL}/crawl",
        headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
        json=crawl_payload,
        timeout=120
    )
    response.raise_for_status()
    
    crawl_result = response.json()
    
    # Extract pages from crawl result
    pages = []
    if "data" in crawl_result:
        for item in crawl_result["data"]:
            page = {
                "url": item.get("url"),
                "content": item.get("markdown", ""),
                "title": item.get("title", ""),
                "crawled_at": datetime.utcnow().isoformat()
            }
            pages.append(page)
    
    return pages

def compute_content_hash(content: str) -> str:
    """
    Compute MD5 hash of content for deduplication.
    
    Args:
        content: Text content
    
    Returns:
        MD5 hash hex string
    """
    return hashlib.md5(content.encode()).hexdigest()

def similarity(a: str, b: str) -> float:
    """
    Compute similarity between two strings (0-1).
    
    Args:
        a: First string
        b: Second string
    
    Returns:
        Similarity score (1 = identical, 0 = different)
    """
    return SequenceMatcher(None, a, b).ratio()

def deduplicate_pages(pages: list[dict], threshold: float = 0.90) -> list[dict]:
    """
    Remove duplicate and near-duplicate pages.
    
    Uses exact match (MD5) first, then fuzzy match (Levenshtein).
    Keeps first occurrence, removes later duplicates.
    
    Args:
        pages: List of page records
        threshold: Similarity threshold for near-duplicates (default 0.90)
    
    Returns:
        Deduplicated list of pages
    """
    seen_hashes = set()
    seen_contents = []
    unique_pages = []
    
    for page in pages:
        content = page.get("content", "").strip()
        
        if not content:
            continue
        
        # Check for exact duplicate (MD5 hash)
        content_hash = compute_content_hash(content)
        if content_hash in seen_hashes:
            continue
        
        # Check for near-duplicate (90%+ similarity)
        is_duplicate = False
        for seen_content in seen_contents:
            sim = similarity(content[:1000], seen_content[:1000])  # Compare first 1000 chars for speed
            if sim >= threshold:
                is_duplicate = True
                break
        
        if not is_duplicate:
            seen_hashes.add(content_hash)
            seen_contents.append(content)
            unique_pages.append(page)
    
    return unique_pages

def filter_quality(pages: list[dict]) -> list[dict]:
    """
    Filter pages by quality heuristics.
    
    Removes:
    - Pages <500 tokens (too short)
    - Pages >10K tokens (likely noise/listing pages)
    - Pages with <30% unique words (boilerplate-heavy)
    - Pages with low content density (<1 sentence per 50 words)
    
    Args:
        pages: List of page records
    
    Returns:
        Filtered list of quality pages
    """
    filtered = []
    
    for page in pages:
        content = page.get("content", "").strip()
        
        # Minimum length: 500 tokens (~3000 chars)
        if len(content) < 500:
            continue
        
        # Maximum length: 10K tokens (~60K chars)
        if len(content) > 60000:
            continue
        
        # Calculate token estimate (rough: 1 token ≈ 4 chars)
        token_count = len(content) / 4
        
        # Uniqueness: count unique words
        words = re.findall(r'\b\w+\b', content.lower())
        unique_words = len(set(words))
        uniqueness = unique_words / len(words) if words else 0
        
        # Too much boilerplate (navigation, footers): <30% unique
        if uniqueness < 0.30:
            continue
        
        # Sentence count (estimate: sentences end with . ! ?)
        sentences = len(re.findall(r'[.!?]+', content))
        
        # Too sparse (navigation heavy): fewer than 1 sentence per 50 words
        if sentences > 0 and len(words) / sentences > 50:
            continue
        
        # Passed all filters
        page["token_count"] = int(token_count)
        page["uniqueness"] = round(uniqueness, 2)
        filtered.append(page)
    
    return filtered

def chunk_long_content(pages: list[dict], max_tokens: int = 2000) -> list[dict]:
    """
    Split long pages into chunks for training.
    
    Preserves semantic breaks (paragraphs) when possible.
    
    Args:
        pages: List of page records
        max_tokens: Max tokens per chunk (~4 chars per token)
    
    Returns:
        Chunked pages
    """
    chunked = []
    max_chars = max_tokens * 4
    
    for page in pages:
        content = page.get("content", "")
        url = page.get("url", "")
        
        # If content is short enough, keep as-is
        if len(content) <= max_chars:
            chunked.append(page)
            continue
        
        # Split by paragraphs (double newlines)
        paragraphs = content.split("\n\n")
        
        current_chunk = ""
        chunk_num = 1
        
        for para in paragraphs:
            # If adding this paragraph exceeds max, save current chunk and start new
            if len(current_chunk) + len(para) > max_chars and current_chunk:
                chunk_page = {
                    "url": f"{url}#chunk_{chunk_num}",
                    "content": current_chunk.strip(),
                    "title": f"{page.get('title', '')} (Part {chunk_num})",
                    "crawled_at": page.get("crawled_at"),
                    "token_count": int(len(current_chunk) / 4)
                }
                chunked.append(chunk_page)
                
                current_chunk = ""
                chunk_num += 1
            
            current_chunk += para + "\n\n"
        
        # Add remaining chunk
        if current_chunk:
            chunk_page = {
                "url": f"{url}#chunk_{chunk_num}",
                "content": current_chunk.strip(),
                "title": f"{page.get('title', '')} (Part {chunk_num})" if chunk_num > 1 else page.get('title', ''),
                "crawled_at": page.get("crawled_at"),
                "token_count": int(len(current_chunk) / 4)
            }
            chunked.append(chunk_page)
    
    return chunked

def format_as_jsonl(pages: list[dict]) -> str:
    """
    Format pages as JSONL for OpenAI fine-tuning.
    
    Uses {"prompt": "...", "completion": "..."} format.
    For documentation/knowledge base, prompt is title+URL, completion is content.
    
    Args:
        pages: List of page records
    
    Returns:
        JSONL string (one JSON object per line)
    """
    jsonl_lines = []
    
    for page in pages:
        # Create a training example
        prompt = f"Title: {page.get('title', 'Untitled')}\nURL: {page.get('url', '')}\n\nContent:"
        completion = f"\n{page.get('content', '')}"
        
        example = {
            "prompt": prompt,
            "completion": completion
        }
        
        jsonl_lines.append(json.dumps(example))
    
    return "\n".join(jsonl_lines)

def format_as_instruction_jsonl(pages: list[dict]) -> str:
    """
    Format pages as instruction-following JSONL.
    
    Uses {"messages": [{"role": "user", "content": "..."}, ...]} format
    for Anthropic/OpenAI chat fine-tuning.
    
    Args:
        pages: List of page records
    
    Returns:
        JSONL string
    """
    jsonl_lines = []
    
    for page in pages:
        # Create an instruction-following example
        example = {
            "messages": [
                {
                    "role": "user",
                    "content": f"Explain: {page.get('title', 'topic')}"
                },
                {
                    "role": "assistant",
                    "content": page.get('content', '')
                }
            ]
        }
        
        jsonl_lines.append(json.dumps(example))
    
    return "\n".join(jsonl_lines)

# Example: Main training data pipeline
if __name__ == "__main__":
    print("LLM Training Data Scraping Pipeline")
    print("-" * 50)
    
    # Step 1: Crawl domain
    domain = "https://docs.fastcrw.com"  # Example: crawl fastCRW docs
    pages = crawl_domain(domain, max_depth=3, max_pages=500)
    
    print(f"\nCrawled {len(pages)} pages")
    
    # Step 2: Deduplicate
    unique_pages = deduplicate_pages(pages, threshold=0.90)
    print(f"After dedup: {len(unique_pages)} unique pages")
    
    # Step 3: Filter for quality
    quality_pages = filter_quality(unique_pages)
    print(f"After quality filter: {len(quality_pages)} pages")
    
    # Show statistics
    tokens = [p.get("token_count", 0) for p in quality_pages]
    total_tokens = sum(tokens)
    print(f"Total tokens in dataset: {total_tokens:,} (~{int(total_tokens / 1000)}K)")
    
    # Step 4: Chunk long content
    chunked_pages = chunk_long_content(quality_pages, max_tokens=2000)
    print(f"After chunking: {len(chunked_pages)} chunks")
    
    # Step 5: Format as JSONL
    jsonl_output = format_as_jsonl(chunked_pages)
    
    # Save to file
    with open("training_data.jsonl", "w") as f:
        f.write(jsonl_output)
    
    print(f"\nSaved to training_data.jsonl")
    
    # Example: Alternative instruction format for chat fine-tuning
    instruction_jsonl = format_as_instruction_jsonl(chunked_pages)
    with open("training_data_instruction.jsonl", "w") as f:
        f.write(instruction_jsonl)
    
    print(f"Also saved instruction format to training_data_instruction.jsonl")
    
    # Show first example
    if chunked_pages:
        print("\nExample training data point:")
        example = json.loads(jsonl_output.split("\n")[0])
        print(f"Prompt length: {len(example['prompt'])} chars")
        print(f"Completion length: {len(example['completion'])} chars")

Step 3: Run the pipeline

export FASTCRW_API_KEY="your_api_key_here"
uv run python scrape_training_data.py

This creates training_data.jsonl (OpenAI fine-tuning format) and training_data_instruction.jsonl (chat format) ready to upload.

Step 4: Upload to OpenAI fine-tuning

openai api fine_tunes.create -t training_data.jsonl -m gpt-3.5-turbo

Or use HuggingFace Datasets:

from datasets import Dataset

data = [json.loads(line) for line in open("training_data.jsonl")]
dataset = Dataset.from_dict({
    "text": [item["prompt"] + item["completion"] for item in data]
})
dataset.push_to_hub("your-username/domain-dataset")

Production Considerations

Deduplication at Scale

For very large crawls (100K+ pages), exact hash matching is fast, but pairwise fuzzy matching is O(n²). Use approximate matching:

MinHash: Fast approximate deduplication for very large corpora.
Bloom filters: Space-efficient set membership for hashes.
Locality-sensitive hashing: Groups similar content without comparing all pairs.

For datasets under 100K pages, the Python code above is sufficient.

Quality Filtering

The heuristics above (length, uniqueness, sparsity) are good starting points, but domain-specific filtering is better:

Technical docs: Favor pages with code examples.
Legal docs: Favor pages with specific section headers.
Blog content: Favor pages with dates, author info, detailed explanations.

Sample your filtered dataset manually to calibrate filters.

Chunking Strategy

Long documents (>2K tokens) need splitting. Break at semantic boundaries when possible:

Documentation: Split by section headers (## Level 2 headings)
Articles: Split by paragraphs or sentences
Code: Split by function definitions
Books: Split by chapters

fastCRW's markdown output preserves structure, making semantic chunking easier.

Fine-tuning Evaluation

Always measure improvement:

Create a held-out test set (10-20% of data).
Fine-tune a model on the training set.
Evaluate on domain-specific tasks: accuracy, relevance, style match.
Compare to baseline (non-fine-tuned model).
A/B test with real users if possible.

Small high-quality datasets (100–1,000 examples) often beat large mediocre ones (100K examples).

Legal and Ethical Notes

Copyright and Licensing

Only scrape content you own or have permission to use:

Your own content: Blogs, docs, wikis—OK to scrape.
Open-source projects: Documentation with open licenses (MIT, Apache)—OK to scrape.
Creative Commons content: Only if your use respects the license (e.g., CC-BY requires attribution).
Paywalled content: Books, news behind paywalls—not OK.
Copyrighted material: Articles, tutorials without explicit permission—risky.

Check the license before scraping. When in doubt, contact the author or use an official API.

Ethical Training Data

High-quality training data should:

Represent diverse perspectives: Avoid narrow sources that skew your model.
Exclude toxic content: Remove hate speech, spam, misinformation.
Respect privacy: Don't scrape personal data or private communications.
Attribute sources: Document where your training data comes from.

Training on biased data produces biased models. Audit your dataset.

FAQ

Q: What's the minimum dataset size for meaningful fine-tuning?

A: Depends on model size and task. For small models (Llama 2 7B), 100 high-quality examples can improve performance. For large models (GPT-4), 50–100 examples are often sufficient. Quality matters far more than quantity. Experiment with your domain.

Q: How do I deduplicate pages that use templates but have different data?

A: Templated pages (product listings, search results) have identical structure but different content. Use content-based deduplication (Levenshtein distance) rather than structure. fastCRW outputs markdown, which makes this easier—you're comparing text, not HTML.

Q: Can I fine-tune on multiple domains?

A: Yes. Combine JSONL files from different domains. But be aware: multi-domain fine-tuning may cause the model to lose specificity. Train separate models per domain for best results, or use RAG (Retrieval-Augmented Generation) with separate knowledge bases.

Q: What about updating my training data over time?

A: Re-crawl regularly (monthly or quarterly) and retrain the model. Or use incremental fine-tuning (start from previous fine-tuned model, not base model). This keeps your model current without expensive retraining from scratch.

Q: How do I handle pages that are mostly code?

A: Code-heavy pages are valuable for code generation and technical Q&A. Keep them. Use keyword filtering to remove documentation spam (autogenerated API docs with no explanation), but preserve actual tutorials and code examples.

Q: Should I filter for English-only content?

A: Depends on your use case. If you're fine-tuning a model for English only, filtering for language is good. Use a language detection library (langdetect, FastText) to identify non-English pages and exclude them.

Q: How do I benchmark my fine-tuned model?

A: Create a domain-specific benchmark task (e.g., "answer 50 common questions in your domain"). Evaluate both base and fine-tuned models. Compare accuracy, relevance, response time, and cost. High-quality fine-tuning should show 20-50% improvement.

Firecrawl alternatives — managed API comparison if you're building corpora at scale
Jina Reader alternatives — markdown-extraction tradeoffs for training-clean text
LangChain integration — load documents and route them into vector stores
LlamaIndex integration — ingestion pipeline patterns for dataset construction
RAG pipelines — retrieval-augmented generation, the most common consumer of this dataset
Content aggregation — broader pattern for high-volume corpus building
Deep research — agentic research workflows that lean on the same corpora

Sources

fastCRW Crawl Endpoint — Full Domain Crawling

https://docs.fastcrw.com/api/crawl

OpenAI Fine-Tuning Documentation

https://platform.openai.com/docs/guides/fine-tuning

HuggingFace Datasets Library

https://huggingface.co/docs/datasets

JSONL Format for LLM Training

https://en.wikipedia.org/wiki/JSON#JSON_Streaming

LLM Training Data Quality Best Practices

https://arxiv.org/abs/2401.08103

FAQ

What's the difference between fine-tuning and RAG data?

Fine-tuning modifies the model weights by training on domain-specific data; RAG (Retrieval-Augmented Generation) feeds external documents to the model at inference time. Fine-tuning datasets are smaller (~100K tokens) and curated. RAG datasets can be larger and less curated. fastCRW output works for both—adjust your filtering strictness based on the use case.

How do I deduplicate content at scale?

Use MD5 hashes for exact duplicates, then Levenshtein distance (difflib) for near-duplicates (>90% similarity). For very large datasets, use MinHash or SimHash for faster approximate matching.

How much content do I need to fine-tune?

For small models (Llama 2 7B), 1,000–10,000 examples are effective. For larger models (GPT-4), 100–1,000 high-quality examples can still improve performance. fastCRW helps you scrape the raw data; quality matters more than quantity.

Can I scrape proprietary data for fine-tuning?

No. Only scrape data you own or have permission to use. Scraping copyrighted content (news articles, books, paywalled content) without permission violates copyright law and may breach licensing agreements. Stick to open-source content, your own data, or content with explicit scraping permission.

How do I format JSONL for different APIs?

OpenAI fine-tuning uses {"prompt": "...", "completion": "..."}. Anthropic uses {"messages": [...]}. HuggingFace uses custom schemas. See each provider's docs for exact format. fastCRW outputs markdown; you transform it to the target format in your code.

What should I filter out?

Remove pages with: <500 tokens (too short), >10,000 tokens (noise), <20% unique words (boilerplate), low keyword relevance, navigation/footer-heavy content, or off-topic pages. Use automated heuristics + manual sampling to calibrate filters.

How do I measure dataset quality?

Test fine-tuned model performance on a held-out test set. Compare to baseline (non-fine-tuned). Evaluate on domain-specific benchmarks: accuracy, BLEU, ROUGE, or human evaluation. Small high-quality datasets often beat large low-quality ones.

Recommended next step

Run a live scrape before you commit.

Use the hosted demo to test scrape, crawl, or map output with fastCRW semantics.

Try Playground

Continue exploring

More from Use Cases

View all use cases

Next in Use Cases

Web Scraping for Brand Monitoring

Use Cases

Web Scraping for News Aggregation

Build a news aggregation pipeline with fastCRW: discover URLs across news sites, scrape full articles, deduplicate content, and summarize with LLM extraction.

news aggregation apiDiscover news URLs via RSS sitemaps and `/v1/map` endpoint

Use Cases

Web Scraping for Real Estate Data

Use fastCRW to build property listing pipelines from real estate sites with structured extraction of price, location, and features.

real estate data scrapingExtract price, address, beds/baths, square footage, and property type

Use Cases

Web Scraping for Price Monitoring

Use fastCRW to scrape competitor prices, track e-commerce changes, and trigger alerts when prices shift across markets.

price monitoring web scrapingScrape e-commerce sites with JavaScript rendering for dynamic pricing

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.

Web Scraping for LLM Training Data

Verdict

Why LLM Training Needs Web Scraping

Where fastCRW Fits

Architecture Overview

Implementation Walkthrough

Step 1: Install dependencies

Step 2: Crawl domain and deduplicate

Step 3: Run the pipeline

Step 4: Upload to OpenAI fine-tuning

Production Considerations

Deduplication at Scale

Quality Filtering

Chunking Strategy

Fine-tuning Evaluation

Legal and Ethical Notes

Copyright and Licensing

Ethical Training Data

FAQ

Related resources

More from Use Cases

Web Scraping for News Aggregation

Web Scraping for Real Estate Data

Web Scraping for Price Monitoring

Keep the crawl path moving

Alternatives

Benchmarks

Docs