By the fastCRW team · Benchmark figures verified 2026-05-18 · Self-host pricing is $0 (AGPL-3.0); managed launch pricing expires 2026-06-01 · Verify independently before relying on numbers.
Review sentiment analysis with web scraping starts with clean extraction
Review sentiment analysis with web scraping is a two-part problem, and almost everyone obsesses over the wrong part. The model and the prompt get all the attention; the extraction step that feeds them gets treated as plumbing. But a sentiment score is downstream of the text you scored — if a review was silently dropped before it reached the model, no amount of prompt tuning recovers it, and if the review body arrived wrapped in nav chrome, cookie banners, and "helpful?" widgets, the model is scoring noise. Garbage in, garbage out applies to review mining with unusual force, because the signal you care about is a few sentences buried in a page built for humans, not analysts.
So this guide treats the scrape-to-sentiment loop as one pipeline: collect reviews from review sites and listing pages, score each review with aspect-based structured output, then aggregate and trend the result over time. The brand angle is simple and we will be honest about it (we build fastCRW): your sentiment numbers are only as trustworthy as the review text underneath them, and fastCRW posts the highest truth-recall of the three tools tested — 63.74% on Firecrawl's public 819-URL labeled dataset (diagnose_3way.py, 2026-05-08) versus Crawl4AI's 59.95% and Firecrawl's 56.04% on the same set — so fewer reviews are silently missing before you score them.
Sentiment quality is downstream of text quality
Recall is the metric that matters most here. A review platform with 4,000 reviews is a sample; if your scraper captures 3,200 of them and quietly fails on the rest, your "average sentiment" is computed on a biased subset (failures are rarely random — they cluster on the longest, most detailed, often most negative reviews that trip up parsers). fastCRW's 91.8% scrape-success (of reachable URLs) with 0 thrown errors on the 1,000-URL run (diagnose_3way.py, 2026-05-08) is the paired number to anchor on: you want both a high success rate and a guarantee that failures surface as visible errors rather than silent empties you never reconcile.
Markdown vs raw HTML for LLM scoring
Feed an LLM raw review-page HTML and you pay for thousands of tokens of markup, inflate latency, and bury the actual review text in DOM noise the model has to learn to ignore. fastCRW returns clean, LLM-ready markdown by default — the review body, rating, and date as readable text, not <div> soup. That is the right input for sentiment scoring: cheaper tokens, lower variance, and a prompt that sees the review rather than the page. See LLM-ready markdown extraction for why markdown beats HTML as a model input.
Collecting reviews from review sites and listings
Review collection has three sub-problems: finding the pages that hold reviews, walking paginated review sets, and coping with reviews that only appear after JavaScript runs.
Discovering review and product listing pages
Start with /v1/map to enumerate a site's URLs, then filter to the product or listing pages that carry reviews. For a known set of products you can skip discovery and crawl directly. The endpoint surface is small and Firecrawl-compatible — /v1/map, /v1/crawl, and /v1/scrape — so existing Firecrawl pipelines port over after a base-URL swap.
Crawling paginated review sets
Most review sections paginate: ?page=2, "load more," or numbered links. Point /v1/crawl at the first review page and let its BFS walk follow pagination links, bounded by maxDepth (cap 10) and maxPages (cap 1000). A crawl returns each page as clean markdown, billed at 1 credit per page. For infinite-scroll review widgets that never expose a "next" URL, you fall back to scraping the rendered page and extracting whatever reviews loaded — which is a real limit, not a solved problem, and we cover it under constraints below.
Handling JS-rendered review widgets
Plenty of review content is injected by a third-party widget (a star-rating SaaS embedded via script). fastCRW's renderer is auto by default and escalates chrome → lightpanda → http, so JS-rendered reviews are captured without you hand-picking an engine — at the same flat 1 credit per page regardless of which renderer runs. In fast mode, fastCRW's p90 is 4,348 ms — the lowest of the three tools tested (Crawl4AI 4,754 ms, Firecrawl 6,937 ms). The chrome-stealth fallback that recovers widget-rendered reviews extends the tail when those hard pages are hit; for an offline review-mining batch you typically care more about recall than tail latency. See /benchmarks for the full split.
Scoring review sentiment with an LLM
Once reviews are clean markdown, scoring is a structured-extraction call, not free-text generation.
Per-review structured sentiment with a JSON schema
Use formats: ["json"] with a jsonSchema on /v1/scrape to get a typed object back per review instead of prose you have to re-parse. A minimal schema captures the polarity and a confidence, e.g. { sentiment: "positive" | "neutral" | "negative", score: number, rationale: string }. Structured output is what makes the result aggregable. Requests with formats: ["json"] cost 5 credits — that is the price of the LLM extraction leg, on top of the 1-credit page fetch. The deep dive on schema design lives in structured extraction with a JSON schema.
Aspect-based sentiment fields (price, quality, support)
Overall polarity is the blunt version. Aspect-based sentiment analysis decomposes a review into the dimensions you actually act on — price, quality, shipping, support, reliability — and scores each independently, because a single review routinely says "great product, terrible delivery." Encode the aspects directly in the schema as nested fields ({ aspects: { price: {...}, quality: {...}, support: {...} } }) so the model returns one polarity per aspect plus an optional supporting quote. That is the difference between "sentiment dropped this quarter" and "sentiment dropped because support complaints tripled while product satisfaction held" — the second is something a product team can route to an owner.
Managed LLM extraction on a paid plan
LLM-based JSON extraction in fastCRW is a managed feature available on paid plans — fastCRW runs the model over the review text and your prompt, so there is nothing to operate yourself. Be precise about scope: the formats: ["json"] extraction path and the managed /v1/search answer mode both run a managed LLM and both require a paid plan; the FREE plan has no LLM features. For sentiment runs where reviews may contain customer PII, scope what you send into the extraction prompt — only the review text the schema needs — rather than whole raw pages.
Aggregating and trending review sentiment
Per-review scores are an input, not the deliverable. The deliverable is a trend.
Rolling review sentiment over time
Aggregate per-review aspect scores into rolling windows — net sentiment per week, per product, per aspect — and the signal you ship is the slope, not the snapshot. This is also where aspect decomposition pays off: a flat overall line can hide a quality improvement cancelling a support regression, and only the per-aspect series shows it.
Storing snapshots yourself (stateless engine)
fastCRW is stateless per request: it scrapes and scores, it does not remember yesterday's reviews for you. That is a deliberate design choice, and it means the time series is yours to own — you persist each run's scored reviews to your own store, keyed by review ID and run date, and compute trends and deltas there. Pair a scheduled crawl with idempotent upserts so re-scraped reviews update in place rather than duplicating; scheduled crawls on cron covers the recurring-run mechanics.
Cost and scale of a review-mining run
The economics of review mining are dominated by one decision: do you pay the 5-credit extraction fee per review, or batch reviews before scoring?
Credit math for crawl plus JSON extraction
Two cost components. Crawling the review pages is 1 credit per page regardless of renderer. Extracting structured sentiment is 5 credits per request that uses formats: ["json"]. The lever is how many reviews you pack into a single extraction call: scoring one review per call is 5 credits each; scoring a page of 20 reviews in one schema'd call is 5 credits for the batch. For a 5,000-review corpus, that is the difference between ~25,000 extraction credits and ~1,250 — design your schema to score a page of reviews at once, not one at a time. Note the honest constraint that shapes this: fastCRW has no multi-URL batch extract endpoint, so you batch reviews within a page's schema, and iterate /v1/scrape concurrently or crawl across pages.
Self-host vs managed for high volume
At high volume the recurring extraction cost dominates, which is where self-hosting changes the math: the AGPL-3.0 engine is free to run (you pay only your own server), so a nightly re-score of a large review corpus has no per-page cloud fee. Managed fastCRW is the right call when you would rather not operate the engine; self-host is the right call when the corpus is large and the run is recurring. Compare live numbers on /pricing rather than trusting a table that may have moved.
Honest constraints
Two limits will bite a review-mining pipeline, and we would rather you hit them in planning than in production.
No screenshot capture of review widgets (HTTP 422)
fastCRW does not support screenshot output — a request for formats: ["screenshot"] returns HTTP 422. If your workflow needs a visual of a star-rating widget as evidence (for a dispute, an audit, or a compliance archive), fastCRW will not produce it; you score the extracted text, not a pixel snapshot. Plan around the text path.
Anti-bot on hardened review platforms
The large review platforms invest heavily in bot defenses. fastCRW has no Fire-engine-style managed anti-bot layer, so on aggressively hardened sites you will hit blocks the engine does not transparently solve, and login-gated reviews need your own authenticated access. This is a genuine area where a cloud vendor with a dedicated anti-bot stack wins; if your target list is dominated by hardened platforms, weigh that honestly. For sentiment on social posts and brand mentions rather than structured reviews, the social media monitoring spoke covers the search-first variant of this pipeline.
Sources
- fastCRW canonical fact sheet — scrape benchmark (truth-recall 63.74% of 819 labeled URLs, 91.8% scrape-success of reachable URLs, 0 errors, p50 1914 ms / fast-mode p90 4348 ms),
diagnose_3way.py, 2026-05-08; credit costs; endpoint surface; honest gaps. - fastCRW repo and pricing: github.com/us/crw · fastcrw.com
- Firecrawl public
scrape-content-dataset-v1(1,000 URLs, 819 labeled), used as the shared accuracy benchmark: firecrawl.dev
Related: Structured extraction with a JSON schema · LLM-ready markdown extraction · Scheduled crawls on cron · Social media monitoring for brand health
