Use Cases/Use Case / Dataset Curation

Web Dataset Curation for ML Training

Assemble training-ready JSONL datasets from the open web with fastCRW — /v1/map to enumerate URLs, /v1/scrape to fetch them as clean markdown, then deduplicate and serialise for HuggingFace, OpenAI fine-tuning, or your own loader.

Published

May 27, 2026

Updated

May 27, 2026

Who this is for

Researchers and ML engineers building training datasets from the open web — fine-tuning corpora for domain models, evaluation sets for retrieval benchmarks, instruction data for small task-specific models. The work is not the model; it is shipping a clean, deduplicated, reproducible JSONL file that you can hand to a training loop and never apologise for.

fastCRW is the front of that pipeline. Map enumerates URLs, scrape fetches them as markdown, and the rest is your filtering and serialisation logic.

Why fastCRW for dataset curation

Three properties matter for dataset work: discovery is separate from extraction, the markdown output is compact and reproducible, and the license lets you run the crawl at zero per-page cost.

POST /v1/map (docs.fastcrw.com/api-reference/map/) returns every URL reachable from a seed, optionally filtered by a search substring and bounded by includeSubdomains. This is the cheap discovery pass — it lets you see the size and shape of the corpus before you commit scrape credits.

POST /v1/scrape (docs.fastcrw.com/api-reference/scrape/) then fetches each URL and returns clean markdown. Markdown matters here because it is denser per token than HTML, the structure (headings, lists, code blocks) survives the trip, and the resulting dataset is easier to diff across re-crawls when you need to audit drift.

fastCRW ships as a single static Rust binary under AGPL-3.0. Self-hosters pay $0 per 1,000 scrapes — only the server bill — which is the right cost shape when your dataset needs are measured in millions of pages.

The 5-step recipe

Enumerate candidate URLs with /v1/map. POST /v1/map with the seed domain to discover every reachable URL. Use the search and includeSubdomains options to narrow the surface before you spend any scrape credits.
Fetch each URL as markdown with /v1/scrape. Iterate the URL list through POST /v1/scrape with formats ["markdown"]. Run with bounded concurrency (32-64 workers) so you do not melt the source site or your own rate limits.
Deduplicate on content hash and near-duplicate similarity. Hash each page with MD5 for exact dedup, then run a SimHash or MinHash pass for near-duplicates above ~90% similarity. The open web is full of mirrors; skipping this step poisons fine-tuning.
Filter for length, language, and quality. Drop pages under ~500 tokens (too short to train on) and over ~10,000 tokens (mostly aggregation noise). Run a language ID pass if you care about a single locale. Keep a quality heuristic — unique-word ratio, boilerplate density — that you can re-run later.
Serialise to JSONL with provenance metadata. Emit one JSON object per line — text, source URL, content hash, fetched_at, license — so the dataset is reproducible and auditable. JSONL plays cleanly with HuggingFace datasets, OpenAI fine-tuning, and any streaming loader.

# curate_dataset.py — run with: python3 curate_dataset.py
import os
import json
import hashlib
import datetime as dt
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests

CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}

def discover(seed: str, needle: str | None = None) -> list[str]:
    payload: dict = {"url": seed, "includeSubdomains": False}
    if needle:
        payload["search"] = needle
    r = requests.post(f"{CRW}/map", json=payload, headers=HEADERS, timeout=60)
    r.raise_for_status()
    return r.json()["data"]["links"]

def scrape(url: str) -> dict | None:
    r = requests.post(
        f"{CRW}/scrape",
        json={"url": url, "formats": ["markdown"]},
        headers=HEADERS, timeout=60,
    )
    if not r.ok:
        return None
    md = r.json()["data"]["markdown"]
    if not md or len(md.split()) < 500:
        return None
    return {
        "url": url,
        "text": md,
        "content_hash": hashlib.md5(md.encode()).hexdigest(),
        "fetched_at": dt.datetime.utcnow().isoformat() + "Z",
    }

def curate(seed: str, out_path: str, needle: str | None = None) -> None:
    urls = discover(seed, needle)
    seen_hashes: set[str] = set()
    with open(out_path, "w") as fh, ThreadPoolExecutor(max_workers=32) as pool:
        futures = [pool.submit(scrape, u) for u in urls]
        for fut in as_completed(futures):
            row = fut.result()
            if not row or row["content_hash"] in seen_hashes:
                continue
            seen_hashes.add(row["content_hash"])
            fh.write(json.dumps(row, ensure_ascii=False) + "\n")

if __name__ == "__main__":
    curate("https://docs.fastcrw.com", "fastcrw_docs.jsonl")

Next steps

The /v1/map and /v1/scrape references live at docs.fastcrw.com; managed-cloud pricing for teams that prefer not to run the binary is on fastcrw.com/pricing. For dataset work at the million-page scale, self-host the binary and partition the seed list by domain so you can scale workers horizontally on commodity hardware.

fastCRWlive

Scrape any URL, live

Get 500 free credits →

Sources

fastCRW /v1/map reference

https://docs.fastcrw.com/api-reference/map/

fastCRW /v1/scrape reference

https://docs.fastcrw.com/api-reference/scrape/

HuggingFace datasets library

https://huggingface.co/docs/datasets

FAQ

Why JSONL instead of CSV or Parquet for training data?

JSONL is line-delimited, append-friendly, and streams without loading the whole file into memory — exactly what HuggingFace datasets, OpenAI fine-tuning, and most custom training loops expect. Parquet is denser for columnar analytics; CSV breaks on nested fields. JSONL is the modern default for text corpora.

What does the AGPL-3.0 license mean for my training dataset?

The license applies to the fastCRW binary itself, not to your dataset. You can self-host the binary to crawl the open web at $0 per 1,000 scrapes and use the resulting dataset under whatever terms the source pages allow. Respect each site's robots.txt and terms of service — the binary follows robots.txt by default and only overrides when you opt in for legitimate cases.

How big a corpus is reasonable for one fastCRW worker?

A single worker comfortably handles tens of thousands of pages per day on commodity hardware, since the binary idles around 50 MB RAM (per `crw-opencore/README.md` structural footprint). For million-page corpora, partition the seed list by domain and run multiple workers in parallel.

Recommended next step

Claim an API key and start shipping.

Move from evaluation to implementation with credits, docs, and a compatibility-first API.

Create Account

Continue exploring

More from Use Cases

View all use cases

Previous in Use Cases

Web Scraping for RAG Pipelines

Next in Use Cases

AI-Powered Structured Extraction from the Web

Use Cases

Web Scraping for Real Estate Data

Use fastCRW to build property listing pipelines from public real estate sites with structured extraction of price, location, beds/baths, and features.

real estate data scrapingExtract price, address, beds/baths, square footage, and property type

Use Cases

Web Scraping for Content Aggregation

Build a comprehensive content aggregation pipeline with fastCRW: discover URLs across any source, scrape full-text pages into clean markdown, deduplicate, extract structured metadata, and feed a data pipeline dashboard — all via a single Firecrawl-compatible API.

web scraping for content aggregationDiscover all content URLs on any domain with a single `/v1/map` call

Use Cases

Web Scraping for RAG and AI Agent Training Data

Collect, clean, and normalize web corpora for RAG knowledge bases and AI agent training datasets with fastCRW — high-fidelity markdown, 63.74% truth-recall, Firecrawl-compatible API, single Rust binary.

web scraping for rag training data63.74% truth-recall on Firecrawl's public 1,000-URL benchmark (`diagnose_3way.py`, 2026-05-08) — highest of three tools tested

Related hubs

Keep the crawl path moving

Alternatives

Compare fastCRW against adjacent tools for the same workload.

Benchmarks

Check where internal performance claims start and stop.

Docs

Move into route-level implementation guidance for this workflow.