By the fastCRW team · ArXivQA run completed and scored 2026-06-20 on the deployed endpoints · Verify independently.

The short version

Firecrawl shipped a Research Index: a purpose-built API for research agents — search papers, inspect metadata, read passages, expand the citation graph. On the public ArXivQA paper-retrieval benchmark, their Research Index scores 53.3% recall, the highest of any provider on the board.

We built a drop-in compatible Research API — same endpoint paths, same request/response shapes, so the Firecrawl research SDK works unchanged against api.fastcrw.com — and scored it on the same 191 questions. fastCRW reaches 61.0% recall: +7.7 points ahead of Firecrawl, and clear of Claude (45.4%), Parallel (44.3%), and Exa (43.4%).

The interesting part isn't the number. It's that we got there with no self-hosted paper index at all — every query is answered live. Here's how.

Disclosure: we build fastCRW, a Firecrawl-compatible, open-core (AGPL-3.0) scrape/crawl/map/search engine. This post is about a new endpoint family we shipped. We're honest below about what's live, what's a thin wrapper, and where the recall actually comes from.

The benchmark

ArXivQA is 191 natural-language research questions, each with a set of ground-truth arXiv papers that answer it. You're scored on recall — the fraction of each question's correct papers you surface — averaged over all 191. Extra papers don't hurt, so the whole game is coverage.

The questions are deliberately hard. Many ask for a family of papers ("papers proposing overlong reward shaping", "LoRA variants that insert a new matrix between two existing ones"). Some ask "what does paper X compare against" — where the answer lives inside X's own bibliography. Some ask "which open model is best on benchmark Y" — where you have to read a leaderboard. A title-and-abstract search alone misses most of them.

Provider	ArXivQA recall
fastCRW Research API	61.0%
Firecrawl Research Index	53.3%
Claude	45.4%
Parallel	44.3%
Exa	43.4%

ArXivQA recall — fastCRW 61.0% vs Firecrawl 53.3%, Claude 45.4%, Parallel 44.3%, Exa 43.4%, run on each provider's live deployed endpoint

No index. Live retrieval.

The obvious way to win a paper-retrieval benchmark is to build a big semantic index over every arXiv paper and search it. We didn't. fastCRW's Research API answers each query live, by merging three retrieval sources at request time:

Our own search — self-hosted SearXNG over web (Google/Bing) plus a research mode that routes to arXiv, Crossref, and scholarly engines. This is the primary recall driver. The agent rewrites each question into 8–12 exact-name queries — specific method, model, dataset, and benchmark names — and those specific queries surface the niche papers a single broad query never ranks.
Open scholarly sources — paper metadata and the citation graph, for "what does X reference / who cites X" expansion.
Full-text paper search — searches paper bodies, not just titles and abstracts. This turned out to matter most: in a source-by-source breakdown it was the single highest-recall source, and it contributed roughly 19% of ground-truth papers that nothing else found — the papers that mention the topic deep in the text but never in the title or abstract.

That last point is the whole reason title-search retrievers cap out. The answer to "papers that do X" is often a paper whose abstract never says "X" — but whose method section does. Searching the body recovers it.

The endpoints are dumb. The skill is smart.

Firecrawl's Research Index is split into stateless endpoints (search-papers, inspect-paper, read-paper, related-papers) plus a research skill that orchestrates them. We mirror that split exactly.

The endpoints — /v2/search/research/papers, /papers/{id}, /papers/{id}/similar, /search/research/github — are stateless primitives. They just retrieve and rank. The intelligence lives in the research skill that drives them:

Intent routing. "Compare-against" questions → pull the seed paper's own references (the answer is in its bibliography; citation APIs lag months on new papers, the PDF doesn't). "Best on benchmark Y" → read the leaderboard, extract the open model names, search each one's technical report. "Enumerate papers that do X" → exact-name reframings plus, when there's a tight survey, harvest its list.
Exact-name decomposition. One broad query is weak; 8–12 specific-name queries is strong. This is the single biggest recall lever.

This is why the headline number is the skill over the endpoints, not the endpoints called blind — the same way Firecrawl's number is measured. Drop-in compatibility means the skill, the SDK, and the CLI all work the same against either provider.

Drop-in compatible

The response shapes match Firecrawl's research SDK field-for-field — paperId, primaryId, prefix-less ids, the two distinct paper shapes for search vs inspect. Point the Firecrawl research SDK at https://api.fastcrw.com and it works:

curl -s -H "Authorization: Bearer $FASTCRW_API_KEY" \
  "https://api.fastcrw.com/v2/search/research/papers?query=diffusion%20image%20synthesis&k=20"

Or the fastCRW SDK:

from crw import CrwClient
c = CrwClient(api_key="crw_live_…")
c.search_papers("diffusion image synthesis", k=20)
c.related_papers("arxiv:1706.03762", intent="efficient transformers", mode="references")

Honest limits

It's live, so it's seconds, not milliseconds. A hot in-memory index answers in tens of milliseconds; our multi-source live merge takes seconds, and the slowest leg is an upstream source's shared rate limit. We trade a little latency for current coverage and zero index maintenance. For an agent doing deep research, seconds-per-call is fine; for autocomplete, it isn't.
Read-passages are abstract-scoped today. Full arXiv-body passage retrieval is on the roadmap.
The number is the agent + skill, run live against the deployed endpoints — all 191 questions, ground truth hidden. It's the product, not a private harness.

Try it

Install the research skill into your agent (Claude Code, Cursor, Codex, Gemini CLI) with one command:

npx skills add us/crw@crw-research

The Research API docs cover every endpoint, and the ArXivQA benchmark page is the canonical reference for the 61.0%. If you already use Firecrawl's research SDK, the only change is the base URL.

We Built a Drop-In Firecrawl Research API — and Beat It on ArXivQA (61% vs 53.3%)

The short version

The benchmark

No index. Live retrieval.

The endpoints are dumb. The skill is smart.

Drop-in compatible

Honest limits

Try it

Try CRW Free

More engineering posts

Firecrawl for RAG Pipelines: What It's Great At, and Where the Bill Bites

What Is Local-First Web Scraping?

What Is Agentic Search and Why It Beats Stale Caches