Benchmarks/Benchmark / ArXivQA · research recall

Results from the ArXivQA Research-Retrieval Benchmark

fastCRW's Research API vs Firecrawl's Research Index on the alphaXiv ArXivQA benchmark — 191 paper-retrieval queries scored on recall. fastCRW reaches 61.0% vs Firecrawl 53.3%.

Published

June 20, 2026

Updated

June 20, 2026

What this benchmark measures

ArXivQA is alphaXiv's public paper-retrieval benchmark: 191 natural-language research questions, each with a set of ground-truth arXiv papers that answer it. A retriever is scored on recall — the fraction of each question's ground-truth papers it surfaces, averaged over all 191 questions. Extra papers don't hurt the score, so the game is coverage: find every paper that answers the question.

The questions are hard on purpose. Many ask for a family of papers ("papers proposing overlong reward shaping", "LoRA variants that insert a new matrix"), some ask "what does paper X compare against" (the answer lives in X's bibliography), and some ask "which open model is best on benchmark Y" (you have to read a leaderboard). A title-and-abstract search alone misses most of them.

Results

ArXivQA recall — fastCRW 61.0%, Firecrawl 53.3%, Claude 45.4%, Parallel 44.3%, Exa 43.4% — run on each provider's live deployed endpoint

Provider	Recall
fastCRW Research API	61.0%
Firecrawl Research Index	53.3%
Claude	45.4%
Parallel	44.3%
Exa	43.4%

fastCRW reaches 61.0% — the highest provider on the board, +7.7 points ahead of Firecrawl's Research Index and well clear of the rest. Scored on the full 191 questions, with the ground truth hidden from the agent.

How it works — live, no index

fastCRW's Research API is not a pre-built paper index. Each query is answered live by merging multiple retrieval sources at request time:

Our own search (self-hosted SearXNG) — web (Google/Bing) + a research mode that routes to arXiv, Crossref, and scholarly engines. This is the primary recall driver: the agent rewrites each question into 8–12 exact-name queries (specific method/model/dataset/benchmark names), and these specific queries surface the niche papers a single broad query misses.
Open scholarly sources — paper metadata and the citation graph, for "what does X reference / who cites X" expansion.
Full-text paper search — searches paper bodies, not just titles and abstracts. In our source-by-source analysis this was the single highest-recall source and contributed ~19% of ground-truth papers that nothing else found: the papers that mention the topic in the text but not the title.

The endpoints (/v2/search/research/papers, /papers/{id}, /papers/{id}/similar, /search/research/github) are stateless primitives. The intelligence — intent routing, exact-name reframing, reading a leaderboard, pulling a paper's self-references for "compare-against" questions — lives in the open research skill, exactly the way Firecrawl splits its Research Index endpoints from its research skill.

Honest scope

The score is the agent + skill over the live endpoints, the same shape Firecrawl's number is measured in. It is not the endpoints called blind.
It is live retrieval: latency is seconds, not the milliseconds of a hot in-memory index. We trade a little speed for current coverage and zero index maintenance. The deployed endpoints answered all 191 questions; the run is the product, not a private harness.
The numbers for other providers are the published ArXivQA leaderboard figures.

Reproduce it

The endpoints are a drop-in target for the Firecrawl research SDK — point it at https://api.fastcrw.com and the /v2/search/research/* surface matches. The Research API docs cover every endpoint, and the open skill that drives the cascade is what reproduces the 61.0%.

Continue exploring

More from Benchmarks

View all benchmarks

Next in Benchmarks

Results from the 1,000-URL Firecrawl Dataset Benchmark

Benchmarks

Search Benchmark: fastCRW vs Tavily vs Firecrawl

100-query concurrent search benchmark comparing fastCRW, Tavily, and Firecrawl on latency, win rate, and reliability across 10 query categories.

search api benchmark880ms average latency vs Tavily's 2,000ms

Benchmarks

fastCRW Benchmark Methodology

How fastCRW frames internal and third-party benchmark claims, including metric definitions, source provenance, and interpretation rules.

web scraping benchmark methodologyInternal and third-party results are labeled separately

Related hubs

Keep the crawl path moving

Alternatives

Apply benchmark context to real tool-selection decisions.

Use Cases

Translate measurement data into concrete scraping workloads.

Docs

Use the product docs once you know what you want to test.