Web Dataset Curation for ML Training
Assemble training-ready JSONL datasets from the open web with fastCRW — /v1/map to enumerate URLs, /v1/scrape to fetch them as clean markdown, then deduplicate and serialise for HuggingFace, OpenAI fine-tuning, or your own loader.
Who this is for
Researchers and ML engineers building training datasets from the open web — fine-tuning corpora for domain models, evaluation sets for retrieval benchmarks, instruction data for small task-specific models. The work is not the model; it is shipping a clean, deduplicated, reproducible JSONL file that you can hand to a training loop and never apologise for.
fastCRW is the front of that pipeline. Map enumerates URLs, scrape fetches them as markdown, and the rest is your filtering and serialisation logic.
Why fastCRW for dataset curation
Three properties matter for dataset work: discovery is separate from extraction, the markdown output is compact and reproducible, and the license lets you run the crawl at zero per-page cost.
POST /v1/map
(docs.fastcrw.com/api-reference/map/)
returns every URL reachable from a seed, optionally filtered by a search
substring and bounded by includeSubdomains. This is the cheap discovery
pass — it lets you see the size and shape of the corpus before you commit
scrape credits.
POST /v1/scrape
(docs.fastcrw.com/api-reference/scrape/)
then fetches each URL and returns clean markdown. Markdown matters here
because it is denser per token than HTML, the structure (headings, lists,
code blocks) survives the trip, and the resulting dataset is easier to
diff across re-crawls when you need to audit drift.
fastCRW ships as a single static Rust binary under AGPL-3.0 (per
marketing/CANONICAL-FACTS.md §1). Self-hosters pay $0 per 1,000 scrapes
— only the server bill — which is the right cost shape when your dataset
needs are measured in millions of pages.
The 5-step recipe
- Enumerate candidate URLs with /v1/map. POST /v1/map with the seed domain to discover every reachable URL. Use the search and includeSubdomains options to narrow the surface before you spend any scrape credits.
- Fetch each URL as markdown with /v1/scrape. Iterate the URL list through POST /v1/scrape with formats ["markdown"]. Run with bounded concurrency (32-64 workers) so you do not melt the source site or your own rate limits.
- Deduplicate on content hash and near-duplicate similarity. Hash each page with MD5 for exact dedup, then run a SimHash or MinHash pass for near-duplicates above ~90% similarity. The open web is full of mirrors; skipping this step poisons fine-tuning.
- Filter for length, language, and quality. Drop pages under ~500 tokens (too short to train on) and over ~10,000 tokens (mostly aggregation noise). Run a language ID pass if you care about a single locale. Keep a quality heuristic — unique-word ratio, boilerplate density — that you can re-run later.
- Serialise to JSONL with provenance metadata. Emit one JSON object per line — text, source URL, content hash, fetched_at, license — so the dataset is reproducible and auditable. JSONL plays cleanly with HuggingFace datasets, OpenAI fine-tuning, and any streaming loader.
# curate_dataset.py — run with: python3 curate_dataset.py
import os
import json
import hashlib
import datetime as dt
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
CRW = "https://api.fastcrw.com/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['CRW_API_KEY']}"}
def discover(seed: str, needle: str | None = None) -> list[str]:
payload: dict = {"url": seed, "includeSubdomains": False}
if needle:
payload["search"] = needle
r = requests.post(f"{CRW}/map", json=payload, headers=HEADERS, timeout=60)
r.raise_for_status()
return r.json()["data"]["links"]
def scrape(url: str) -> dict | None:
r = requests.post(
f"{CRW}/scrape",
json={"url": url, "formats": ["markdown"]},
headers=HEADERS, timeout=60,
)
if not r.ok:
return None
md = r.json()["data"]["markdown"]
if not md or len(md.split()) < 500:
return None
return {
"url": url,
"text": md,
"content_hash": hashlib.md5(md.encode()).hexdigest(),
"fetched_at": dt.datetime.utcnow().isoformat() + "Z",
}
def curate(seed: str, out_path: str, needle: str | None = None) -> None:
urls = discover(seed, needle)
seen_hashes: set[str] = set()
with open(out_path, "w") as fh, ThreadPoolExecutor(max_workers=32) as pool:
futures = [pool.submit(scrape, u) for u in urls]
for fut in as_completed(futures):
row = fut.result()
if not row or row["content_hash"] in seen_hashes:
continue
seen_hashes.add(row["content_hash"])
fh.write(json.dumps(row, ensure_ascii=False) + "\n")
if __name__ == "__main__":
curate("https://docs.fastcrw.com", "fastcrw_docs.jsonl")
Next steps
The /v1/map and /v1/scrape references live at
docs.fastcrw.com; managed-cloud pricing for
teams that prefer not to run the binary is on
fastcrw.com/pricing. For dataset work at
the million-page scale, self-host the binary and partition the seed list
by domain so you can scale workers horizontally on commodity hardware.
Continue exploring
More from Use Cases
Bulk Vector Database Ingestion with fastCRW
Crawl a whole domain into clean markdown, embed in batches, and bulk-insert into Pinecone, pgvector, or Qdrant — fastCRW's /v1/crawl makes the front of the vector pipeline a single async job.
Web Scraping for RAG Pipelines
Turn any website into chunked, embedded, retrieval-ready vectors with fastCRW — clean markdown, predictable JSON, and a single binary you can self-host.
AI-Powered Structured Extraction from the Web
Pull typed JSON out of any web page with fastCRW — define a JSON Schema, call /v1/extract on managed cloud (or /v1/scrape + jsonSchema self-hosted), and skip the brittle selector layer entirely.
Related hubs