Skip to main content
Tutorial

Financial Research Agent: Web Scraping for Investing

Build a financial research agent that searches and scrapes filings, IR pages, and news live, then feeds clean data to an LLM for investment analysis.

fastcrw
By RecepJune 28, 202611 min readLast updated: June 2, 2026

By the fastCRW team · Benchmarks and capabilities verified 2026-05-18 · fastCRW launch pricing expires 2026-06-01 · Verify independently.

Disclosure: we build fastCRW, so weight this accordingly. We have kept the gaps and the slow-tail latency explicit, because a finance agent that pretends its retrieval layer has no failure modes is a liability, not an advantage.

What a financial research agent does

A financial research agent built on web scraping is a loop, not an endpoint: it searches the live web for the freshest primary sources, scrapes filings, investor-relations (IR) pages, and news into clean text, extracts the specific numbers and facts you care about, and hands that context to an LLM that synthesizes an answer. The reason you build this instead of asking a base model directly is freshness. A model's weights froze at training time; an 8-K filed this morning, a guidance revision on an IR page, or a downgrade in today's news simply does not exist inside the model. The scraping layer is what closes that gap.

Filings, IR pages, and news as live sources

The three source classes behave differently. Regulatory filings (10-K, 10-Q, 8-K and their international equivalents) are dense, structured, and authoritative — they are the ground truth. IR pages carry the company's own framing: earnings decks, press releases, and dividend or buyback announcements, often before they propagate to aggregators. News and analyst commentary is the noisiest layer but the fastest-moving. A useful agent reads all three and weights them by reliability, treating a filing as fact and a headline as a signal to verify.

Why freshness beats a stale model

Investment decisions are time-sensitive in a way that most LLM use cases are not. The value of "what changed in the last 24 hours" decays by the hour, and a confidently wrong answer drawn from stale training data is worse than no answer. The architectural consequence is that your agent's quality is bounded by its retrieval layer: if the scraper silently drops the page that announced the guidance cut, the smartest synthesis model in the world will produce a fluent, wrong thesis. That is why the rest of this post is mostly about the retrieval primitives, not the prompt.

The search-scrape-synthesize loop

The core loop has three primitives, and fastCRW exposes each as a Firecrawl-compatible REST endpoint, so the official Firecrawl SDK works against fastCRW after a base-URL swap.

Live web search for fresh sources

Step one is discovery: given a ticker or company, find today's relevant URLs. POST /v1/search (backed by a SearXNG sidecar) returns ranked results and can optionally scrape their content in the same call. Search costs 1 credit per query. On latency, fastCRW search averaged 880 ms over a 100-query benchmark and took 73 of 100 latency wins against Firecrawl and Tavily (triple-bench.ts, 100 queries) — the discovery hop is rarely your bottleneck. We cover the search primitive in depth in the search API for AI agents.

Scraping filings and IR pages to clean markdown

Step two turns a URL into LLM-ready text. POST /v1/scrape returns clean markdown, which matters enormously for finance: a 10-K rendered as raw HTML buries the MD&A section in navigation chrome and footnote tables, and feeding that to a model wastes tokens and degrades extraction. On accuracy, fastCRW posted the highest truth-recall of the three tools tested — 63.74% of 819 labeled URLs (diagnose_3way.py, Firecrawl's public dataset, 2026-05-08), versus Crawl4AI 59.95% and Firecrawl 56.04% on the same set — paired with 91.8% scrape-success of reachable URLs and 0 thrown errors across 3,000 requests. fastCRW also recovers 34 URLs that neither Crawl4AI nor Firecrawl reach — 70% more exclusive recoveries than the other two combined. For a research agent, recall is the headline number: it is the probability that the page carrying the fact you need actually makes it into context. A scrape is 1 credit regardless of renderer — http, lightpanda, or chrome all cost the same flat 1 credit.

Structured extraction of financial fields

Step three pulls specific fields — reported revenue, EPS, guidance range, filing date, sentiment — into a typed record instead of leaving them in prose. Add formats: ["json"] plus a jsonSchema to a /v1/scrape call and the engine returns structured JSON; this bills at 5 credits. LLM extraction is a managed feature available on paid plans (the FREE plan has no LLM features), so the model that reads your filings is the engine's managed LLM. The mechanics are covered in structured extraction with a JSON schema.

Composing deep research yourself

This is the section the brief insists we lead with honestly, because it is the most important architectural decision in the post.

Why there is no managed /v1/deep-research or /v1/agent

fastCRW deliberately does not ship a managed /v1/deep-research endpoint or a /v1/agent (Spark-style) endpoint. If you want a single API call that takes a question and returns a multi-source research report, fastCRW does not have one, and that is a genuine gap relative to vendors that do. What fastCRW gives you instead are the three composable primitives above — search, scrape, extract — plus crawl (/v1/crawl) for walking an entire IR or filings section. You wire the loop yourself.

Build-vs-buy for a research loop

For a finance agent this trade-off usually favors building, for reasons specific to the domain. A managed research endpoint is a black box: you cannot see which sources it consulted, you cannot weight a filing above a forum post, you cannot enforce that the answer cites a primary document, and you cannot audit it after the fact. In regulated investment work those are not nice-to-haves — they are the requirements. Composing the loop on primitives means every step is inspectable: you log the exact URLs searched, the exact markdown scraped, and the exact fields extracted, so a thesis is reproducible and defensible. The cost is engineering effort; the benefit is a glass box instead of a black one. We walk through one such build in the deep research agent guide, and survey the managed alternatives in the best deep research APIs roundup so you can make the call with both options in view.

Keeping the synthesis LLM in your own loop

fastCRW's managed extraction and managed search answer-mode run a managed LLM on paid plans, and that managed step is metered in credits with a hard per-request cap, so it never becomes an unbounded cost. But the reasoning model that actually writes your investment thesis does not have to be the engine's: for investment work most teams hand the extracted, clean context to a model of their own choosing in their own agent loop. The point is that the reasoning layer can stay yours — you pick the synthesis model, you run it where you want, and you can change it without re-platforming your agent.

Latency in the agent loop

An agent that makes several tool calls per question accumulates latency, so the per-call numbers compound. Here is the honest picture.

Search latency and tool-call budgets

Search is the cheap hop: an 880 ms average over the 100-query benchmark (triple-bench.ts) means a discovery step that rarely dominates a multi-second reasoning turn. If your agent issues two or three searches to triangulate sources, budget a couple of seconds for discovery and spend the rest on scraping and synthesis.

Median scrape latency vs the honest tail

fastCRW's median scrape latency was 1914 ms, beating Firecrawl's 2305 ms (diagnose_3way.py, 2026-05-08) — the typical filing or IR page comes back in under two seconds. In fast mode, fastCRW's p90 is 4348 ms — the lowest of the three (Crawl4AI 4754 ms, Firecrawl 6937 ms). The chrome-stealth fallback that recovers the hard pages other tools drop — the same mechanism behind fastCRW's 34 exclusively recovered URLs — adds latency on those specific pages, but those are also the pages your competitors' agents miss entirely. For an agent loop: scrape sources concurrently rather than serially so any slow page does not block the others, and check the full p50/p90 split at /benchmarks.

Cost and privacy for sensitive research

Investment research is both cost-sensitive at volume and frequently confidential, and the engine's shape addresses both.

Predictable per-query and per-scrape credits

The credit model is flat and forecastable: 1 credit per search query, 1 credit per scraped page (same flat 1 for any renderer — http, lightpanda, or chrome), 1 credit per crawled page, and 5 credits per JSON-schema extraction. A research turn that searches twice, scrapes five sources, and extracts structured fields from three of them is roughly 2 + 5 + 15 = 22 credits — and you can compute that in advance rather than discovering it on the invoice. There is no separate extraction subscription layered on top. Live tiers and credit grants are on the pricing page; the fastCRW pricing is 500 one-time lifetime credits, enough to prototype the loop end to end.

Self-host so research targets stay private

The fastCRW engine is a single static Rust binary (~8 MB image, one container) under AGPL-3.0, so you can self-host it for $0 in software cost — you pay only for your own server. For a hedge fund or research desk, that is the difference between a thesis-in-progress that lives on someone else's cloud and one where the watch-list, the tickers you are quietly researching, and the scraped filing content never leave your infrastructure. The list of names you are studying is itself signal; self-hosting keeps that signal in-house. The trade-off is honest: self-hosting means you operate the binary and you do not get the managed cloud's conveniences.

Honest gaps for finance workloads

Concede plainly where fastCRW is not the right fit so you can plan around it.

No managed research or agent endpoints

To restate it once more because it is the crux: there is no /v1/deep-research and no /v1/agent. If your team's hard requirement is a single managed call that returns a finished report and you do not want to own the loop, a vendor with those endpoints genuinely wins for you, and you should use one. fastCRW's bet is that finance teams want the glass box.

Anti-bot on hardened data portals

fastCRW has no Fire-engine-style anti-bot infrastructure, the engine is stateless per request, and robots.txt is respected by default. Many regulator and exchange filing systems are open and scrape cleanly, but some commercial data portals are heavily hardened against automation; against those, expect to supply your own access path, a licensed data feed, or an API. There is also no screenshot output (a formats: ["screenshot"] request returns HTTP 422) and no multi-URL batch extract — for many sources you iterate /v1/scrape concurrently or crawl. None of these block the search-scrape-synthesize loop for the common case of public filings, IR pages, and news; they bound it at the edges.

Sources

  • fastCRW canonical fact sheet (internal): scrape benchmark (diagnose_3way.py, 819 labeled URLs, 2026-05-08), search benchmark (triple-bench.ts, 100 queries), API surface, honest gaps.
  • fastCRW facts and full p50/p90/p99 distribution: fastcrw.com/benchmarks.
  • fastCRW repo and managed cloud: github.com/us/crw · fastcrw.com.
  • Firecrawl public scrape-content dataset (the labeled set used above): docs.firecrawl.dev.

Related: Deep research agent with CRW · Best deep research APIs · Search API for AI agents · Structured extraction with a JSON schema

FAQ

Frequently asked questions

How do I build a financial research agent with live web data?
Compose three primitives in a loop: POST /v1/search (1 credit/query) to discover today's filings, IR pages, and news; POST /v1/scrape (1 credit/page) to convert each URL into clean LLM-ready markdown; and a /v1/scrape call with formats:['json'] plus a jsonSchema (5 credits, a managed feature on paid plans) to extract typed fields like revenue, EPS, and guidance. Hand the extracted context to a synthesis model in your own agent loop. fastCRW is Firecrawl-compatible, so the Firecrawl SDK works after a base-URL swap.
Does fastCRW have a managed deep-research or agent endpoint?
No. fastCRW does not ship /v1/deep-research or /v1/agent (Spark-style) endpoints — that is a genuine gap versus vendors that offer a single managed research call. Instead it exposes composable primitives (search, scrape, crawl, extract) so you build and audit the research loop yourself. For finance work that is often preferable because every source, scrape, and extracted field is inspectable, but if you need a managed black-box report endpoint, a vendor with one wins for that requirement.
How fast is fastCRW search for an agent loop?
fastCRW search averaged 880 ms over a 100-query benchmark and took 73 of 100 latency wins against Firecrawl and Tavily (triple-bench.ts). Scraping: median latency was 1914 ms, beating Firecrawl's 2305 ms. In fast mode, p90 is 4348 ms — the lowest of the three tested. For the minority of URLs requiring chrome-stealth recovery (the same path that gives fastCRW 34 exclusive recoveries no competitor reaches), latency rises on those specific pages. Scrape sources concurrently rather than serially to hide per-page variance.
Can I keep my research LLM and targets private?
Yes. fastCRW's managed extraction and answer synthesis run a managed LLM on paid plans, but you can hand the clean extracted context to a synthesis model of your own choosing in your own agent loop, so the reasoning model stays one you choose and govern. The engine is a single ~8 MB static Rust binary under AGPL-3.0 that you can self-host for $0 in software cost, so your watch-list, the tickers you are researching, and scraped filing content never leave your infrastructure.
Is it cheaper to build a research loop than buy an investment research API?
It depends on volume and requirements, but the per-operation economics are flat and forecastable: 1 credit per search query, 1 per scraped page (flat 1 for any renderer — no chrome surcharge), 1 per crawled page, and 5 per JSON-schema extraction, with no separate extraction subscription. A research turn of two searches, five scrapes, and three extractions is about 22 credits. Self-hosting the AGPL-3.0 engine drops software cost to $0 (you pay only your server). A managed research API may still be worth it if you value a single call over owning and auditing the loop.

Get Started

Try CRW Free

Self-host for free (AGPL) or use fastCRW cloud with 500 free credits — no credit card required.

Continue exploring

More tutorial posts

View category archive