Skip to main content
Use Cases/Use Case / Dataset Curation

Web Dataset Curation for ML Training

Assemble training-ready JSONL datasets from the open web with fastCRW — /v1/map to enumerate URLs, /v1/scrape to fetch them as clean markdown, then deduplicate and serialise for HuggingFace, OpenAI fine-tuning, or your own loader.

Published
May 27, 2026
Updated
May 27, 2026
Category
use cases
Map → scrape → JSONL is the whole pipeline; no orchestration layerMarkdown output keeps the dataset compact and reproducibleAGPL-3.0 self-host means zero per-page cost for the crawl side

Who this is for

Researchers and ML engineers building training datasets from the open web — fine-tuning corpora for domain models, evaluation sets for retrieval benchmarks, instruction data for small task-specific models. The work is not the model; it is shipping a clean, deduplicated, reproducible JSONL file that you can hand to a training loop and never apologise for.

fastCRW is the front of that pipeline. Map enumerates URLs, scrape fetches them as markdown, and the rest is your filtering and serialisation logic.

Why fastCRW for dataset curation

Three properties matter for dataset work: discovery is separate from extraction, the markdown output is compact and reproducible, and the license lets you run the crawl at zero per-page cost.

POST /v1/map (docs.fastcrw.com/api-reference/map/) returns every URL reachable from a seed, optionally filtered by a search substring and bounded by includeSubdomains. This is the cheap discovery pass — it lets you see the size and shape of the corpus before you commit scrape credits.

POST /v1/scrape (docs.fastcrw.com/api-reference/scrape/) then fetches each URL and returns clean markdown. Markdown matters here because it is denser per token than HTML, the structure (headings, lists, code blocks) survives the trip, and the resulting dataset is easier to diff across re-crawls when you need to audit drift.

fastCRW ships as a single static Rust binary under AGPL-3.0 (per marketing/CANONICAL-FACTS.md §1). Self-hosters pay $0 per 1,000 scrapes — only the server bill — which is the right cost shape when your dataset needs are measured in millions of pages.

The 5-step recipe

  1. Enumerate candidate URLs with /v1/map. POST /v1/map with the seed domain to discover every reachable URL. Use the search and includeSubdomains options to narrow the surface before you spend any scrape credits.
  2. Fetch each URL as markdown with /v1/scrape. Iterate the URL list through POST /v1/scrape with formats ["markdown"]. Run with bounded concurrency (32-64 workers) so you do not melt the source site or your own rate limits.
  3. Deduplicate on content hash and near-duplicate similarity. Hash each page with MD5 for exact dedup, then run a SimHash or MinHash pass for near-duplicates above ~90% similarity. The open web is full of mirrors; skipping this step poisons fine-tuning.
  4. Filter for length, language, and quality. Drop pages under ~500 tokens (too short to train on) and over ~10,000 tokens (mostly aggregation noise). Run a language ID pass if you care about a single locale. Keep a quality heuristic — unique-word ratio, boilerplate density — that you can re-run later.
  5. Serialise to JSONL with provenance metadata. Emit one JSON object per line — text, source URL, content hash, fetched_at, license — so the dataset is reproducible and auditable. JSONL plays cleanly with HuggingFace datasets, OpenAI fine-tuning, and any streaming loader.
# curate_dataset.py — run with: python3 curate_dataset.py
import os
import json
import hashlib
import datetime as dt
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests

CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}

def discover(seed: str, needle: str | None = None) -> list[str]:
    payload: dict = {"url": seed, "includeSubdomains": False}
    if needle:
        payload["search"] = needle
    r = requests.post(f"{CRW}/map", json=payload, headers=HEADERS, timeout=60)
    r.raise_for_status()
    return r.json()["data"]["links"]

def scrape(url: str) -> dict | None:
    r = requests.post(
        f"{CRW}/scrape",
        json={"url": url, "formats": ["markdown"]},
        headers=HEADERS, timeout=60,
    )
    if not r.ok:
        return None
    md = r.json()["data"]["markdown"]
    if not md or len(md.split()) < 500:
        return None
    return {
        "url": url,
        "text": md,
        "content_hash": hashlib.md5(md.encode()).hexdigest(),
        "fetched_at": dt.datetime.utcnow().isoformat() + "Z",
    }

def curate(seed: str, out_path: str, needle: str | None = None) -> None:
    urls = discover(seed, needle)
    seen_hashes: set[str] = set()
    with open(out_path, "w") as fh, ThreadPoolExecutor(max_workers=32) as pool:
        futures = [pool.submit(scrape, u) for u in urls]
        for fut in as_completed(futures):
            row = fut.result()
            if not row or row["content_hash"] in seen_hashes:
                continue
            seen_hashes.add(row["content_hash"])
            fh.write(json.dumps(row, ensure_ascii=False) + "\n")

if __name__ == "__main__":
    curate("https://docs.fastcrw.com", "fastcrw_docs.jsonl")

Next steps

The /v1/map and /v1/scrape references live at docs.fastcrw.com; managed-cloud pricing for teams that prefer not to run the binary is on fastcrw.com/pricing. For dataset work at the million-page scale, self-host the binary and partition the seed list by domain so you can scale workers horizontally on commodity hardware.

Continue exploring

More from Use Cases

View all use cases

Related hubs

Keep the crawl path moving