Web Scraping for LLM Training Data
Use fastCRW to crawl domains into markdown, deduplicate, filter quality, and output JSONL for fine-tuning and RAG datasets.
fastCRW excels at turning web content into fine-tuning and RAG datasets. Crawl a domain into clean markdown, deduplicate pages, filter for quality and relevance, and output JSONL—the standard format for HuggingFace, OpenAI fine-tuning, and LLM training. You get a dataset pipeline that respects page deduplication, filters out boilerplate and spam, and structures content for effective fine-tuning. The bottleneck shifts from crawling to quality filtering—fastCRW handles the first part at scale.
Verdict
fastCRW is the fastest path from web content to LLM training data. Crawl a domain into clean markdown, deduplicate and filter pages automatically, then output JSONL ready for OpenAI, HuggingFace, or Anthropic APIs. You get a repeatable dataset pipeline that scales from 100 to 1M+ pages. The hard part isn't scraping—it's quality filtering. fastCRW handles both; you focus on training and evaluation. Fine-tuning on scraped domain data yields models that are far better than base models on specialized topics.
Why LLM Training Needs Web Scraping
Base LLMs (GPT-4, Claude, Llama) are generalists trained on broad internet data. They often perform poorly on specialized domains:
- Technical docs: Generic LLMs hallucinate API signatures and parameter names.
- Legal writing: Domain jargon and precedent matter; base models miss nuances.
- Medical information: Base models are too slow, too verbose, miss critical details.
- Internal knowledge: Your company's processes, codebase, policies—not in training data.
Fine-tuning on domain-specific data teaches models to:
- adopt your writing style and terminology,
- follow your company's processes,
- prioritize accuracy on domain tasks,
- and reduce hallucinations in narrow domains.
Web scraping lets you build fine-tuning datasets at scale from public sources (documentation, tutorials, open-source code) or your own content (docs, blogs, internal wikis).
Where fastCRW Fits
| Stage | fastCRW Role |
|---|---|
| Data collection | crawl entire domain into markdown |
| Cleaning | Markdown output removes HTML boilerplate automatically |
| Deduplication | Remove exact + near-duplicate pages |
| Filtering | Quality heuristics (length, keyword relevance, noise) |
| Formatting | Output JSONL for fine-tuning APIs |
fastCRW handles the first two; your code handles filtering and formatting.
Architecture Overview
A typical LLM training dataset pipeline has six stages:
- Crawl: Fetch all pages from target domain(s) as markdown.
- Deduplicate: Remove exact duplicates (MD5) and near-duplicates (Levenshtein).
- Filter: Remove short, noisy, or irrelevant pages using heuristics.
- Segment: Split long documents into chunks suitable for training.
- Structure: Format as JSONL with prompt/completion or instruction/input/output fields.
- Upload: Load into fine-tuning API or training framework.
fastCRW handles crawl and deduplication; your code handles filtering through formatting.
Implementation Walkthrough
Here's a complete Python example that crawls a domain, deduplicates pages, filters for quality, and outputs JSONL ready for OpenAI fine-tuning.
Step 1: Install dependencies
uv venv
uv pip install requests python-dotenv
Step 2: Crawl domain and deduplicate
import json
import os
import requests
import hashlib
import time
import re
from typing import Optional
from datetime import datetime
from difflib import SequenceMatcher
# Load API key
FASTCRW_API_KEY = os.getenv("FASTCRW_API_KEY")
FASTCRW_BASE_URL = "https://api.fastcrw.com/v1"
def crawl_domain(domain: str, max_depth: int = 3, max_pages: int = 1000) -> list[dict]:
"""
Crawl an entire domain and return pages as markdown.
Args:
domain: Domain to crawl (e.g., https://docs.example.com)
max_depth: Max crawl depth (3 = ~1000 pages for medium sites)
max_pages: Max pages to crawl (safeguard against infinite crawls)
Returns:
List of crawled pages with markdown content
"""
print(f"Crawling domain: {domain}")
crawl_payload = {
"url": domain,
"maxDepth": max_depth,
"maxPages": max_pages,
"formats": ["markdown"], # Clean markdown, no HTML
}
response = requests.post(
f"{FASTCRW_BASE_URL}/crawl",
headers={"Authorization": f"Bearer {FASTCRW_API_KEY}"},
json=crawl_payload,
timeout=120
)
response.raise_for_status()
crawl_result = response.json()
# Extract pages from crawl result
pages = []
if "data" in crawl_result:
for item in crawl_result["data"]:
page = {
"url": item.get("url"),
"content": item.get("markdown", ""),
"title": item.get("title", ""),
"crawled_at": datetime.utcnow().isoformat()
}
pages.append(page)
return pages
def compute_content_hash(content: str) -> str:
"""
Compute MD5 hash of content for deduplication.
Args:
content: Text content
Returns:
MD5 hash hex string
"""
return hashlib.md5(content.encode()).hexdigest()
def similarity(a: str, b: str) -> float:
"""
Compute similarity between two strings (0-1).
Args:
a: First string
b: Second string
Returns:
Similarity score (1 = identical, 0 = different)
"""
return SequenceMatcher(None, a, b).ratio()
def deduplicate_pages(pages: list[dict], threshold: float = 0.90) -> list[dict]:
"""
Remove duplicate and near-duplicate pages.
Uses exact match (MD5) first, then fuzzy match (Levenshtein).
Keeps first occurrence, removes later duplicates.
Args:
pages: List of page records
threshold: Similarity threshold for near-duplicates (default 0.90)
Returns:
Deduplicated list of pages
"""
seen_hashes = set()
seen_contents = []
unique_pages = []
for page in pages:
content = page.get("content", "").strip()
if not content:
continue
# Check for exact duplicate (MD5 hash)
content_hash = compute_content_hash(content)
if content_hash in seen_hashes:
continue
# Check for near-duplicate (90%+ similarity)
is_duplicate = False
for seen_content in seen_contents:
sim = similarity(content[:1000], seen_content[:1000]) # Compare first 1000 chars for speed
if sim >= threshold:
is_duplicate = True
break
if not is_duplicate:
seen_hashes.add(content_hash)
seen_contents.append(content)
unique_pages.append(page)
return unique_pages
def filter_quality(pages: list[dict]) -> list[dict]:
"""
Filter pages by quality heuristics.
Removes:
- Pages <500 tokens (too short)
- Pages >10K tokens (likely noise/listing pages)
- Pages with <30% unique words (boilerplate-heavy)
- Pages with low content density (<1 sentence per 50 words)
Args:
pages: List of page records
Returns:
Filtered list of quality pages
"""
filtered = []
for page in pages:
content = page.get("content", "").strip()
# Minimum length: 500 tokens (~3000 chars)
if len(content) < 500:
continue
# Maximum length: 10K tokens (~60K chars)
if len(content) > 60000:
continue
# Calculate token estimate (rough: 1 token ≈ 4 chars)
token_count = len(content) / 4
# Uniqueness: count unique words
words = re.findall(r'\b\w+\b', content.lower())
unique_words = len(set(words))
uniqueness = unique_words / len(words) if words else 0
# Too much boilerplate (navigation, footers): <30% unique
if uniqueness < 0.30:
continue
# Sentence count (estimate: sentences end with . ! ?)
sentences = len(re.findall(r'[.!?]+', content))
# Too sparse (navigation heavy): fewer than 1 sentence per 50 words
if sentences > 0 and len(words) / sentences > 50:
continue
# Passed all filters
page["token_count"] = int(token_count)
page["uniqueness"] = round(uniqueness, 2)
filtered.append(page)
return filtered
def chunk_long_content(pages: list[dict], max_tokens: int = 2000) -> list[dict]:
"""
Split long pages into chunks for training.
Preserves semantic breaks (paragraphs) when possible.
Args:
pages: List of page records
max_tokens: Max tokens per chunk (~4 chars per token)
Returns:
Chunked pages
"""
chunked = []
max_chars = max_tokens * 4
for page in pages:
content = page.get("content", "")
url = page.get("url", "")
# If content is short enough, keep as-is
if len(content) <= max_chars:
chunked.append(page)
continue
# Split by paragraphs (double newlines)
paragraphs = content.split("\n\n")
current_chunk = ""
chunk_num = 1
for para in paragraphs:
# If adding this paragraph exceeds max, save current chunk and start new
if len(current_chunk) + len(para) > max_chars and current_chunk:
chunk_page = {
"url": f"{url}#chunk_{chunk_num}",
"content": current_chunk.strip(),
"title": f"{page.get('title', '')} (Part {chunk_num})",
"crawled_at": page.get("crawled_at"),
"token_count": int(len(current_chunk) / 4)
}
chunked.append(chunk_page)
current_chunk = ""
chunk_num += 1
current_chunk += para + "\n\n"
# Add remaining chunk
if current_chunk:
chunk_page = {
"url": f"{url}#chunk_{chunk_num}",
"content": current_chunk.strip(),
"title": f"{page.get('title', '')} (Part {chunk_num})" if chunk_num > 1 else page.get('title', ''),
"crawled_at": page.get("crawled_at"),
"token_count": int(len(current_chunk) / 4)
}
chunked.append(chunk_page)
return chunked
def format_as_jsonl(pages: list[dict]) -> str:
"""
Format pages as JSONL for OpenAI fine-tuning.
Uses {"prompt": "...", "completion": "..."} format.
For documentation/knowledge base, prompt is title+URL, completion is content.
Args:
pages: List of page records
Returns:
JSONL string (one JSON object per line)
"""
jsonl_lines = []
for page in pages:
# Create a training example
prompt = f"Title: {page.get('title', 'Untitled')}\nURL: {page.get('url', '')}\n\nContent:"
completion = f"\n{page.get('content', '')}"
example = {
"prompt": prompt,
"completion": completion
}
jsonl_lines.append(json.dumps(example))
return "\n".join(jsonl_lines)
def format_as_instruction_jsonl(pages: list[dict]) -> str:
"""
Format pages as instruction-following JSONL.
Uses {"messages": [{"role": "user", "content": "..."}, ...]} format
for Anthropic/OpenAI chat fine-tuning.
Args:
pages: List of page records
Returns:
JSONL string
"""
jsonl_lines = []
for page in pages:
# Create an instruction-following example
example = {
"messages": [
{
"role": "user",
"content": f"Explain: {page.get('title', 'topic')}"
},
{
"role": "assistant",
"content": page.get('content', '')
}
]
}
jsonl_lines.append(json.dumps(example))
return "\n".join(jsonl_lines)
# Example: Main training data pipeline
if __name__ == "__main__":
print("LLM Training Data Scraping Pipeline")
print("-" * 50)
# Step 1: Crawl domain
domain = "https://docs.fastcrw.com" # Example: crawl fastCRW docs
pages = crawl_domain(domain, max_depth=3, max_pages=500)
print(f"\nCrawled {len(pages)} pages")
# Step 2: Deduplicate
unique_pages = deduplicate_pages(pages, threshold=0.90)
print(f"After dedup: {len(unique_pages)} unique pages")
# Step 3: Filter for quality
quality_pages = filter_quality(unique_pages)
print(f"After quality filter: {len(quality_pages)} pages")
# Show statistics
tokens = [p.get("token_count", 0) for p in quality_pages]
total_tokens = sum(tokens)
print(f"Total tokens in dataset: {total_tokens:,} (~{int(total_tokens / 1000)}K)")
# Step 4: Chunk long content
chunked_pages = chunk_long_content(quality_pages, max_tokens=2000)
print(f"After chunking: {len(chunked_pages)} chunks")
# Step 5: Format as JSONL
jsonl_output = format_as_jsonl(chunked_pages)
# Save to file
with open("training_data.jsonl", "w") as f:
f.write(jsonl_output)
print(f"\nSaved to training_data.jsonl")
# Example: Alternative instruction format for chat fine-tuning
instruction_jsonl = format_as_instruction_jsonl(chunked_pages)
with open("training_data_instruction.jsonl", "w") as f:
f.write(instruction_jsonl)
print(f"Also saved instruction format to training_data_instruction.jsonl")
# Show first example
if chunked_pages:
print("\nExample training data point:")
example = json.loads(jsonl_output.split("\n")[0])
print(f"Prompt length: {len(example['prompt'])} chars")
print(f"Completion length: {len(example['completion'])} chars")
Step 3: Run the pipeline
export FASTCRW_API_KEY="your_api_key_here"
uv run python scrape_training_data.py
This creates training_data.jsonl (OpenAI fine-tuning format) and training_data_instruction.jsonl (chat format) ready to upload.
Step 4: Upload to OpenAI fine-tuning
openai api fine_tunes.create -t training_data.jsonl -m gpt-3.5-turbo
Or use HuggingFace Datasets:
from datasets import Dataset
data = [json.loads(line) for line in open("training_data.jsonl")]
dataset = Dataset.from_dict({
"text": [item["prompt"] + item["completion"] for item in data]
})
dataset.push_to_hub("your-username/domain-dataset")
Production Considerations
Deduplication at Scale
For very large crawls (100K+ pages), exact hash matching is fast, but pairwise fuzzy matching is O(n²). Use approximate matching:
- MinHash: Fast approximate deduplication for very large corpora.
- Bloom filters: Space-efficient set membership for hashes.
- Locality-sensitive hashing: Groups similar content without comparing all pairs.
For datasets under 100K pages, the Python code above is sufficient.
Quality Filtering
The heuristics above (length, uniqueness, sparsity) are good starting points, but domain-specific filtering is better:
- Technical docs: Favor pages with code examples.
- Legal docs: Favor pages with specific section headers.
- Blog content: Favor pages with dates, author info, detailed explanations.
Sample your filtered dataset manually to calibrate filters.
Chunking Strategy
Long documents (>2K tokens) need splitting. Break at semantic boundaries when possible:
- Documentation: Split by section headers (## Level 2 headings)
- Articles: Split by paragraphs or sentences
- Code: Split by function definitions
- Books: Split by chapters
fastCRW's markdown output preserves structure, making semantic chunking easier.
Fine-tuning Evaluation
Always measure improvement:
- Create a held-out test set (10-20% of data).
- Fine-tune a model on the training set.
- Evaluate on domain-specific tasks: accuracy, relevance, style match.
- Compare to baseline (non-fine-tuned model).
- A/B test with real users if possible.
Small high-quality datasets (100–1,000 examples) often beat large mediocre ones (100K examples).
Legal and Ethical Notes
Copyright and Licensing
Only scrape content you own or have permission to use:
- Your own content: Blogs, docs, wikis—OK to scrape.
- Open-source projects: Documentation with open licenses (MIT, Apache)—OK to scrape.
- Creative Commons content: Only if your use respects the license (e.g., CC-BY requires attribution).
- Paywalled content: Books, news behind paywalls—not OK.
- Copyrighted material: Articles, tutorials without explicit permission—risky.
Check the license before scraping. When in doubt, contact the author or use an official API.
Ethical Training Data
High-quality training data should:
- Represent diverse perspectives: Avoid narrow sources that skew your model.
- Exclude toxic content: Remove hate speech, spam, misinformation.
- Respect privacy: Don't scrape personal data or private communications.
- Attribute sources: Document where your training data comes from.
Training on biased data produces biased models. Audit your dataset.
FAQ
Q: What's the minimum dataset size for meaningful fine-tuning?
A: Depends on model size and task. For small models (Llama 2 7B), 100 high-quality examples can improve performance. For large models (GPT-4), 50–100 examples are often sufficient. Quality matters far more than quantity. Experiment with your domain.
Q: How do I deduplicate pages that use templates but have different data?
A: Templated pages (product listings, search results) have identical structure but different content. Use content-based deduplication (Levenshtein distance) rather than structure. fastCRW outputs markdown, which makes this easier—you're comparing text, not HTML.
Q: Can I fine-tune on multiple domains?
A: Yes. Combine JSONL files from different domains. But be aware: multi-domain fine-tuning may cause the model to lose specificity. Train separate models per domain for best results, or use RAG (Retrieval-Augmented Generation) with separate knowledge bases.
Q: What about updating my training data over time?
A: Re-crawl regularly (monthly or quarterly) and retrain the model. Or use incremental fine-tuning (start from previous fine-tuned model, not base model). This keeps your model current without expensive retraining from scratch.
Q: How do I handle pages that are mostly code?
A: Code-heavy pages are valuable for code generation and technical Q&A. Keep them. Use keyword filtering to remove documentation spam (autogenerated API docs with no explanation), but preserve actual tutorials and code examples.
Q: Should I filter for English-only content?
A: Depends on your use case. If you're fine-tuning a model for English only, filtering for language is good. Use a language detection library (langdetect, FastText) to identify non-English pages and exclude them.
Q: How do I benchmark my fine-tuned model?
A: Create a domain-specific benchmark task (e.g., "answer 50 common questions in your domain"). Evaluate both base and fine-tuned models. Compare accuracy, relevance, response time, and cost. High-quality fine-tuning should show 20-50% improvement.
Related resources
- Firecrawl alternatives — managed API comparison if you're building corpora at scale
- Jina Reader alternatives — markdown-extraction tradeoffs for training-clean text
- LangChain integration — load documents and route them into vector stores
- LlamaIndex integration — ingestion pipeline patterns for dataset construction
- RAG pipelines — retrieval-augmented generation, the most common consumer of this dataset
- Content aggregation — broader pattern for high-volume corpus building
- Deep research — agentic research workflows that lean on the same corpora
Continue exploring
More from Use Cases
Web Scraping for News Aggregation
Build a news aggregation pipeline with fastCRW: discover URLs across news sites, scrape full articles, deduplicate content, and summarize with LLM extraction.
Web Scraping for Real Estate Data
Use fastCRW to build property listing pipelines from real estate sites with structured extraction of price, location, and features.
Web Scraping for Price Monitoring
Use fastCRW to scrape competitor prices, track e-commerce changes, and trigger alerts when prices shift across markets.
Related hubs