Apify vs Spider vs Firecrawl vs Crawlee: Web Scrapers 2026
Five web-scraping platforms, five fundamentally different shapes of the same problem. A pre-built scraper marketplace, a Rust-cored crawling API, an open-source LLM-first scraper, a full TypeScript/Python framework, and an AI-extraction library. We pulled every fact from official docs, GitHub, and pricing pages so you can pick by job rather than by hype.

On this page · 14 sections▾
This post vs the sibling Firecrawl post
This is the wider, platform-vs-platform cut: five scraping shapes — pre-built marketplace, Rust API, hosted SaaS, build-it-yourself library, AI extraction. If you want the tighter API-vs-library cut with Playwright as the browser-automation comparator, read Firecrawl vs Anycrawl vs Crawlee vs Playwright (2026). The two posts share the Firecrawl and Crawlee tools but approach the decision from different axes.
TL;DR + 5-branch decision tree
- Need a scraper for a well-known site (Amazon, Twitter, Instagram, Google SERPs)? Search Apify Store first — there are almost 29,000 pre-built actors and yours probably exists. Install Apify Actor MCP to call them from an agent.
- Need a hosted API that returns clean markdown for an LLM? Pick Firecrawl. Mature MCP, Apache 2.0 core, /scrape, /crawl, /map, /search, /extract — covers most agent-facing read jobs.
- Crawling at extreme volume where compute is your biggest cost? Spider’s Rust core and pay-per-GB pricing become attractive at the scale where Node crawlers spend more on EC2 than on payroll.
- Want to own the entire pipeline, run on your own workers, and skip the SaaS bill? Crawlee is the framework. Apache 2.0, Node and Python, 23k+ stars, ships with PlaywrightCrawler, PuppeteerCrawler, CheerioCrawler, request queues, and proxy rotation.
- Want to describe what you want in English instead of writing CSS selectors? ScrapeGraph wraps an LLM around a Playwright pipeline. MIT, 24.9k stars, BYO model (OpenAI, Anthropic, Groq, Gemini, or Ollama for local).
The five are not substitutes. Most production scraping stacks end up using two — typically Firecrawl or Apify for the hosted surface plus Crawlee for the long-tail bulk jobs, or Firecrawl plus ScrapeGraph when LLM extraction is the bottleneck. The expensive decision is which shape fits your workload; once that’s set, the brand inside the shape is a 30-minute test.
Five shapes of scraping in 2026
Web scraping in 2026 looks nothing like it did in 2020. The category split into five distinct shapes, and the right pick depends almost entirely on which shape matches the job — not on which tool has the longer feature list.
1. Pre-built scraper marketplace. Apify Store hosts close to 29,000 ready-made “actors” that scrape specific sites: Amazon listings, Google SERPs, Instagram profiles, Twitter timelines, LinkedIn jobs, every top-100 e-commerce site. You don’t write code — you rent someone else’s already-debugged scraper, supply input, pay per compute unit. The economic shape is “buy, not build,” and it’s the winning play when the long-tail of site-specific quirks would cost you more to solve than the per-run bill.
2. Hosted SaaS scraping API. Firecrawl is the clearest example. You POST a URL, you get back markdown or structured JSON. The vendor owns the browser, the proxies, the anti-bot mitigation, the rendering decisions. You own the key and the bill. This is the default shape for LLM-facing scraping in 2026 because the output is already in the format models expect, and the latency is acceptable for interactive agent loops. Anycrawl, Browserbase, ScrapingBee, and ZenRows live in this same shape.
3. High-performance Rust API. Spider is the most prominent example of a different bet: build the crawler in a systems language so the per-page cost is lower at very high concurrency. The hosted version emphasises raw throughput and the bandwidth-plus-compute pricing model is designed for organisations crawling billions of pages. Adjacent: many engineering teams adopt Rust-cored infrastructure once they hit the “our scrapers cost more than our engineers” threshold.
4. Build-it-yourself library. Crawlee is the canonical framework — also from the Apify team but a fundamentally different product. You install it via npm or pip, write your extraction code, run it on your own workers, and own the cost curve. The framework gives you the hard primitives (request queues, session pools, proxy rotation, browser orchestration) so you spend your time on the site-specific logic, not on the plumbing. The licence is Apache 2.0 and the trade is real: you save SaaS money, you pay in operational time. Scrapy is the Python-only sibling shape; Crawlee’s claim to fame is the headless-browser story is first-class.
5. AI extraction. ScrapeGraph is the cleanest expression of the LLM-first thesis: don’t write CSS selectors at all, describe what you want in English, let the model figure out where on the page the fields live. The library is Python, MIT-licensed, and model-agnostic (OpenAI, Anthropic, Groq, Gemini, or Ollama for fully-local extraction). Firecrawl’s /extract endpoint is the hosted-API expression of the same idea — and we cover the trade-offs between them in the dedicated section below.
These five shapes overlap at the edges. Apify ships actors that internally use Crawlee. Crawlee can call Firecrawl as one of many fetchers. ScrapeGraph can be wrapped in a Crawlee pipeline. The shapes are mental models for picking, not airtight technical categories. If you’re new to MCP itself, our What is MCP primer covers the protocol the Firecrawl and Apify servers speak.
Side-by-side matrix
Every cell is sourced from the official repo, vendor docs, or pricing page as of May 2026. Star counts and pricing move; always confirm at the source before committing.
| Dimension | Apify | Spider | Firecrawl | Crawlee | ScrapeGraph |
|---|---|---|---|---|---|
| Shape | Marketplace + platform | Rust crawling API | Hosted scraping API | OSS framework | AI extraction library |
| License (product) | Closed (SaaS) | Closed (SaaS) | Apache 2.0 | Apache 2.0 | MIT |
| License (core) | Crawlee (OSS, Apache 2.0) | spider-rs (OSS, MIT) | github.com/firecrawl/firecrawl | Library is the product | Library is the product |
| Host | Apify Cloud | Spider Cloud | Hosted + self-host | Self-host only | Self-host only |
| Language | Node, Python (actors) | Rust (core) | TypeScript | TypeScript, Python | Python |
| Paradigm | Rent or build actors | Call API per URL | Call API per URL | Write your crawler | Describe what you want |
| AI extraction built in | Per-actor | Optional via vision model | Yes — /extract endpoint | No (you wire it) | Yes — core feature |
| MCP support | Yes — apify-actor | Community wrappers | Yes — firecrawl-mcp-server | No (framework, not server) | Community wrappers |
| Free tier | $5 credits/mo | 2,500 sign-up credits | 1,000 credits/mo | Free (you pay infra) | Free (you pay LLM) |
| Mid plan | Scale: $199/mo | Pay-as-you-go credits | Standard: $83/mo | — | — |
| MCP.Directory page | /servers/apify-actor | (no entry) | /servers/firecrawl | (no entry) | (no entry) |
Three observations. First, only Firecrawl and Apify have first-party MCP servers on this directory — for the others, agent integration goes through community wrappers or direct HTTP. Second, the open-vs-hosted axis is more nuanced than “OSS good, SaaS bad”; Crawlee and ScrapeGraph are libraries (you operate them), Firecrawl gives you both, and Apify and Spider are SaaS-first with OSS components. Third, AI extraction is now table stakes — four of five ship it; the question is whether you want it bundled with the scraper (Firecrawl, ScrapeGraph) or you pick the model separately (Apify per-actor, Crawlee plus your own code).
Apify — install + recipe
What it does best
Apify wins when the answer to “which site are you scraping?” is on the well-known list — Amazon products, Google SERPs, Instagram profiles, LinkedIn jobs, Booking.com listings, Walmart catalogs. Apify Store has close to 29,000 pre-built actors and the most popular ones are battle-tested against site changes far more often than any in-house scraper. The economic model is “rent the scraper somebody already debugged.” You supply input (a search term, a profile URL, a category), the actor returns structured JSON, you pay per compute unit consumed. The MCP integration on the apify-actor server above lets an AI agent invoke any actor in your account by name and read the results back into context.
Pick this if you...
- Need to scrape a well-known site and prefer rent-over-build for site-specific scrapers
- Want managed infrastructure with scheduling, alerts, versioning, and a dashboard out of the box
- Plan to graduate from local Crawlee scripts to scheduled cloud jobs without re-writing the scraper
- Want an AI agent to invoke a known actor by name from a Cursor or Claude Code chat without writing glue code
Recipe: scrape Google SERPs for a list of keywords via MCP
With Apify Actor MCP installed and your APIFY_TOKEN set, in any MCP client paste this prompt:
Use the Apify Actor MCP. Run the apify/google-search-scraper
actor on this input:
{
"queries": "best espresso machine 2026\nbest grinder 2026",
"countryCode": "us",
"maxPagesPerQuery": 1,
"resultsPerPage": 10
}
Wait for the run to complete (poll if needed), then return the
top 5 organic results for each query — title, URL, and snippet
— as a JSON array.The agent calls call-actor with the input above, polls the run status, then fetches the dataset items when the run finishes. The output drops into the conversation as JSON you can pipe into whatever lives downstream. The same pattern works for any of the ~29k actors in the store; the shape of the call is fixed and only the input schema changes per actor.
Skip it if...
You’re scraping a site nobody has built an actor for, or your volume is high enough that paying $0.20 per compute unit becomes a meaningful line item. The first case sends you to Crawlee or Firecrawl; the second case sends you to Crawlee on your own infrastructure or to Spider for throughput. Apify Free gives you $5 of credits to test the store before committing.
Spider — install + recipe
Spider
Rust-cored crawling API · spider.cloud
Open-source core
github.com/spider-rs/spiderPricing
$1/GB bandwidth + $0.001/min compute · 2,500 free sign-up credits
Output formats
markdown, HTML, JSON, CSV, XML
Rate limit
10K RPM
What it does best
Spider is the answer for crawling at the scale where compute cost becomes a budget conversation. The crawler is written in Rust with async concurrency from the ground up — there are no garbage-collection pauses, no Node event-loop contention, and the per-page CPU cost is materially lower than what JS-based crawlers spend. The hosted API quotes peaks of 100k+ pages per second under best-case conditions and applies pay-per-GB bandwidth plus per-minute compute billing, which lines up cleanly with how very large crawlers actually consume resources. Built-in anti-bot rotation across proxies in 199 countries means you usually don’t bring your own proxy stack at this scale.
Pick this if you...
- Crawl hundreds of millions of pages a month and compute is in your top three line items
- Want to pay only for successful requests with no subscription floor
- Need built-in proxy rotation across many countries without managing the proxy stack yourself
- Are open to a closed-source hosted API but want the option to self-host the open-core spider-rs crawler later
Recipe: crawl a docs site and return markdown
Spider doesn’t ship a first-party MCP server, so use it via curl or its SDK. Here’s the minimal HTTP shape:
curl -X POST https://api.spider.cloud/crawl \
-H "Authorization: Bearer $SPIDER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"limit": 100,
"return_format": "markdown",
"depth": 2
}'The response is a JSON array of pages with markdown bodies, links, and metadata. Wire this into a Crawlee job for the long-tail or call it directly from an agent via a thin custom MCP wrapper — community examples are floating around but no production-grade Spider MCP exists yet on this directory. Track the web-scraping category for new entries.
Skip it if...
You don’t need extreme throughput. For a few thousand pages a day, Firecrawl is simpler and the markdown output quality is generally cleaner. For specific well-known sites, an Apify actor is already debugged. Spider is the right pick when you’re at the scale where the Rust core matters; otherwise it’s overkill.
Firecrawl — install + recipe
What it does best
Firecrawl is the default LLM-facing scraper in 2026 for one clear reason: the markdown it returns is consistently cleaner than what raw HTML-to-Markdown conversion produces. Boilerplate gets stripped, navigation gets dropped, the body gets isolated — and the agent reads what humans would call “the content” rather than a wall of nav links and footer cruft. The five endpoints (/scrape for one URL, /crawl for a site, /map for URL discovery, /search for web search, /extract for structured data) cover the vast majority of read jobs an LLM agent needs. The MCP server at firecrawl/firecrawl-mcp-server wraps all five behind a single stdio install, and the Apache 2.0 core at github.com/firecrawl/firecrawl crosses 100k stars — both hosted and self-host paths are real.
Pick this if you...
- Build an LLM agent that reads public web pages and want the cleanest markdown output without writing converters
- Need /scrape, /crawl, /map, /search, and /extract all behind one auth boundary and one MCP tool surface
- Want the option to self-host with Apache 2.0 once your volume justifies running your own proxy stack
- Are okay starting on the free 1,000-credit tier and graduating to Hobby ($16/mo) or Standard ($83/mo) as volume grows
Recipe: scrape and extract pricing from a competitor page
With Firecrawl MCP installed and your FIRECRAWL_API_KEY set, paste this in Cursor or Claude Code:
Use the Firecrawl MCP. /scrape https://example-saas.com/pricing
in markdown mode, then /extract with this schema:
{
"plans": [{
"name": "string",
"monthly_price_usd": "number | null",
"key_features": "string[]"
}]
}
Return the extracted JSON only — no commentary. If the page
requires JavaScript rendering, set actions: [{ type: "wait",
milliseconds: 1500 }] before extraction.Two tool calls: scrape gives the agent markdown plus the cleaned DOM; extract runs the schema-guided LLM extraction and returns typed JSON. The same shape works against documentation sites, news articles, product catalogs, and almost any public page. The credit cost is 1 credit per page for /scrape plus /extract’s additional LLM cost.
Skip it if...
Your target site needs to be logged in to, or your volume is high enough that 100k credits ($83/mo) per month doesn’t cover it. For interactive flows behind auth, use Playwright MCP instead. For very high volume, self-host the Apache 2.0 core or move to Crawlee.
Crawlee — install + recipe
Crawlee
OSS scraping framework by Apify · TypeScript + Python
Install
npm install crawlee playwright
Crawlers
Cheerio, JSDOM, Playwright, Puppeteer, HTTP
Cost
Free (you pay infra + proxies)
What it does best
Crawlee is the canonical “build your own scraper” framework. It gives you persistent URL queues, session pools, proxy rotation, browser-like header generation, HTTP/2 support, and unified APIs for Cheerio, JSDOM, Playwright, and Puppeteer — so you pick the right tool per site without rewriting the framework around it. The same team builds Apify, so a Crawlee scraper deploys cleanly to Apify Cloud if you want to graduate from local execution to scheduled hosted jobs without throwing away the code. Note for clarity: Crawlee and Apify are different products from the same team — Crawlee is the library you install and run; Apify is the SaaS that hosts actors.
Pick this if you...
- Want full control over the scraping pipeline and prefer paying for compute over paying per-page API fees
- Crawl sites where no Apify actor exists and you don’t want to write the request-queue plumbing from scratch
- Need to mix headless-browser and plain-HTTP crawling in one project with shared session and proxy state
- Want optionality to deploy to Apify Cloud later without changing your scraper code
Recipe: a minimal PlaywrightCrawler
After npm install crawlee playwright:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxConcurrency: 5,
maxRequestsPerCrawl: 100,
async requestHandler({ page, request, enqueueLinks, pushData }) {
const title = await page.title();
const heading = await page.locator('h1').first().textContent();
await pushData({ url: request.url, title, heading });
await enqueueLinks({ strategy: 'same-domain' });
},
});
await crawler.run(['https://example.com']);That’s the entire scraper: queue, browser pool, link discovery, results storage. Run it with node index.js; results land in storage/datasets/default as JSON. Swap PlaywrightCrawler for CheerioCrawler for static HTML — same API, ten times the throughput. The Python equivalent in apify/crawlee-python uses identical concepts with Pythonic syntax.
Skip it if...
You want zero infrastructure overhead, or you only need to scrape a handful of pages a month. Firecrawl’s 1,000 free credits cover the latter without you running anything; Apify Store covers the former by renting somebody else’s actor. Crawlee shines once you’re operating real volume and the SaaS bill starts to compete with your engineering time.
ScrapeGraph — install + recipe
ScrapeGraph
AI-powered scraping library · Python
Install
pip install scrapegraphai
LLM support
OpenAI, Anthropic, Groq, Gemini, Azure, Ollama (local)
Cost
Free library · you pay LLM tokens + infra
What it does best
ScrapeGraph throws out the “write a selector per field” model entirely. You describe what you want in a sentence, you supply an LLM key, the library extracts the page via Playwright, runs the content through the model, and returns structured data. The pitch lives or dies on developer experience — and on the 24.9k-star ScrapeGraphAI/Scrapegraph-ai repo it delivers: prototypes that took two days of selector-writing now take twenty minutes. Model-agnostic is the second pillar: OpenAI, Anthropic, Groq, Gemini, Azure, and Ollama (for fully offline extraction). The Ollama path matters more than vendors typically admit; teams scraping sensitive data use it to keep page content out of any third-party API.
Pick this if you...
- Want to skip writing CSS selectors entirely and describe the schema in natural language
- Prototype scraping pipelines where the field set changes every week
- Scrape sensitive pages and need a fully-local model via Ollama (page content never leaves your network)
- Are okay paying for LLM tokens per page in exchange for near-zero maintenance when sites redesign
Recipe: minimal SmartScraperGraph
After pip install scrapegraphai and playwright install:
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "openai/gpt-4o-mini",
"api_key": "YOUR_OPENAI_KEY",
},
"verbose": False,
"headless": True,
}
smart_scraper = SmartScraperGraph(
prompt="Extract every project: title, description, GitHub stars, language.",
source="https://github.com/trending",
config=graph_config,
)
result = smart_scraper.run()
print(result)Three lines of meaningful code: the prompt, the source, the model. The library handles browser navigation, HTML preprocessing, model invocation, and JSON normalisation. Swap openai/gpt-4o-mini for ollama/llama3.1 with a local Ollama instance for fully-offline extraction.
Skip it if...
You’re scraping the same template across millions of pages — running an LLM per page is slower and more expensive than a CSS selector you write once. The next section unpacks where AI extraction wins and where it loses.
AI extraction vs CSS selectors
The single biggest 2024-to-2026 shift in scraping is the rise of LLM-based extraction. Both Firecrawl’s /extract endpoint and ScrapeGraph’s entire library bet that describing the output in English beats writing CSS selectors by hand. The bet is right for some jobs and wrong for others, and the breakdown matters more than the marketing on either side suggests.
AI extraction wins on three jobs. First, prototypes — you don’t know what fields you want yet, and rewriting selectors after each pivot wastes time. Second, sites that redesign frequently: a model tolerates layout changes that would break a brittle selector chain, because it reads semantic structure rather than DOM paths. Third, low-volume one-offs where the marginal cost of an LLM call ($0.001 to $0.01 per page) is cheaper than thirty minutes of selector authoring.
CSS selectors win on three jobs. First, cost at scale — running an LLM on every page is materially more expensive than a parsed-once selector when you’re scraping ten million pages per month from the same template. Second, determinism: a model can hallucinate a field that doesn’t exist or normalise inconsistently across runs, while a missing selector returns null in a known way. Third, latency: an LLM call adds 500-3000ms per page; a CSS selector adds zero.
Firecrawl’s /extract endpoint occupies a useful middle. The schema-guided extraction is LLM-backed but happens server-side; you pay Firecrawl credits rather than running the LLM yourself; the prompt engineering is hidden behind a JSON schema you supply. It’s the right shape when you want the description-vs-selectors experience but don’t want to operate a model. ScrapeGraph is the right shape when you want to control the model entirely — pick your provider, control prompts directly, run locally if your data sensitivity requires it.
A common production pattern: use AI extraction for the long tail (sites you scrape rarely), use CSS selectors for the high-volume targets (your top ten sources), and fall back to AI extraction whenever selectors break. Many teams ship this hybrid in production and treat “selector breakage rate” as a leading indicator that triggers re-authoring or fallback to AI extraction permanently.
The other piece: model choice matters more than the library choice for AI extraction. GPT-4o-mini and Claude 3.5 Haiku are the cheap-and-fast workhorses; Llama 3.1 via Ollama wins for sensitive-data jobs; GPT-4o and Claude 4 Opus appear when extraction quality on weird layouts matters more than per-page cost. The scraping library doesn’t care which you pick — but your bill does.
Cost shape
The five tools price for different workloads, which is worth comparing apples-to-apples-ish. All numbers are from the vendors’ public pricing pages as of May 2026; confirm at the source before designing around them.
| Tier | Apify | Spider | Firecrawl | Crawlee | ScrapeGraph |
|---|---|---|---|---|---|
| Free | $5 credits/mo | 2,500 sign-up credits | 1,000 credits/mo | Free library | Free library |
| Entry paid | Starter $29/mo + usage | PAYG ($1/GB + $0.001/min) | Hobby $16/mo (5k credits) | — | — |
| Mid | Scale $199/mo + usage | PAYG with volume credits | Standard $83/mo (100k credits) | — | — |
| High | Business $999/mo + usage | Volume pricing (contact) | Growth $333/mo (500k credits) | — | — |
| Top | Enterprise (custom) | Enterprise (custom) | Scale $599/mo (1M credits) | — | — |
| Compute model | Compute units ($0.13-$0.20 each) | Bandwidth + compute time | Credits per page | Self-host infra | Self-host infra + LLM tokens |
| Best for | Marketplace + mid volume | Extreme throughput | LLM-facing scraping | Full pipeline control | AI extraction |
Three observations. Apify is the only one that requires you to think in compute units — not pages — which lines up with the marketplace model where actor complexity varies wildly. Spider’s pay-per-GB plus per-minute compute is the most accurate billing shape for very high volume, because that’s where your real costs live; small workloads find it annoying. Firecrawl’s credit tiers are the most predictable for anyone budgeting from a credits-per-page mental model. Crawlee and ScrapeGraph cost zero in licence fees but require you to operate infrastructure (and an LLM provider for ScrapeGraph) — which can be very cheap or quite expensive depending on volume and how you provision proxies.
Confirm at the source: apify.com/pricing, spider.cloud/pricing, firecrawl.dev/pricing. Crawlee and ScrapeGraph have no vendor pricing — both are libraries.
Common pitfalls
Anti-bot defences are a moving target
Cloudflare, DataDome, PerimeterX, and friends evolve monthly. A scraper that worked last week breaks this week. Firecrawl and Spider rotate proxies and fingerprints on the hosted side; Apify Store actors encode site-specific mitigations per actor; Crawlee and ScrapeGraph leave anti-bot to you. Budget time for maintenance, not just initial implementation, if you self-host. The honest answer: there is no permanent solution; there’s only the choice of who maintains the cat-and-mouse.
JavaScript rendering cost
Headless browsers are 5-20x more expensive than HTTP requests — both in time and in compute. Firecrawl auto-detects when a page needs JS; Spider lets you toggle; Crawlee makes you choose between PlaywrightCrawler and CheerioCrawler explicitly. Default to HTTP wherever the site renders content server-side; reach for a browser only when the page requires it. The wrong default doubles your bill and halves your throughput.
robots.txt is not legal protection (or permission)
Ignoring robots.txt isn’t a crime; respecting it isn’t a green light. It’s a request that courts may weigh in disputes. Site Terms of Service, copyright, the CFAA (US), CMA (UK), and GDPR (EU) are the actually-binding constraints, and they vary by jurisdiction. hiQ v. LinkedIn set important precedent on public data scraping in the US, but the legal landscape is still evolving. Talk to legal before scraping a site you don’t own at meaningful scale.
LLM hallucination in extraction
AI extraction can invent fields that don’t exist on the page, normalise prices inconsistently across runs, and confidently extract from sections that look like the right section but aren’t. Validate every LLM-extracted field against a deterministic check — schema validation, cross-page consistency, comparison against a small selector-based sample. If you can’t validate, you can’t trust. ScrapeGraph and Firecrawl /extract both add this risk; CSS selectors don’t.
Pricing model traps
Apify charges compute units, which vary by actor — a complex actor on a heavy site can burn through $5 of credits faster than you expect. Spider charges bandwidth which feels small until you crawl image-heavy sites and a single domain eats 50GB. Firecrawl’s credit model is simple until you call /extract a lot, where the additional LLM cost stacks. Model the workload before signing — vendor calculators on each pricing page help, but a real two-day test is more honest.
Vendor lock-in via proxy stack
The biggest sunk cost in scraping isn’t the library — it’s the proxy pool. Once you’ve curated 50,000 residential proxies across 30 countries, switching to a different framework is trivial; switching off the proxies (or onto a different proxy vendor) is the expensive move. Hosted APIs hide this complexity but also hide the cost; building your own surfaces both. Pick consciously.
“The scraper is up” isn’t the same as “the data is right”
Sites silently change schemas, change currencies on price fields, A/B test layouts, switch pagination patterns. A scraper that returns 200s can be returning wrong data and you won’t notice until your downstream pipeline does something visibly weird. Add data-quality monitoring — anomaly detection on price distributions, field-presence checks across runs, a daily sample reviewed by a human. None of the five tools ships this; you build it.
Community signal
The scraping space moves fast and the public conversation is correspondingly noisy — vendor marketing claims need to be discounted, but consistent signals across HN, Reddit, GitHub Discussions, and Twitter are useful triangulation. We won’t fabricate quotes; here’s the consensus picture from late 2025 through May 2026.
Firecrawl is the clear default for LLM-facing scraping. The 100k+ star count on the Apache 2.0 repo isn’t signal-noise — it reflects heavy actual usage. Threads on HackerNews and r/LLM consistently surface Firecrawl as the first recommendation when somebody asks “what should my agent use to read web pages,” and the most common gripe is the credit cost at scale rather than output quality. The self-host path is real but requires meaningful operational investment.
Apify has the strongest position in site-specific scraping. The marketplace effect is hard to argue with — when a popular site changes, the actor that scrapes it usually has a fix shipped within hours by an author whose entire business is that one actor. Engineering teams who try to rebuild what an Apify actor does usually conclude rent-don’t-build was the right call after a month of maintenance.
Crawlee remains the framework people reach for when they’ve outgrown a hosted API and want to own the cost curve. The Python port closed the “but we’re a Python shop” objection that previously sent people to Scrapy. The 23k-star repo updates actively; the headless-browser story is uniquely strong compared to alternatives.
Spider is less mainstream but shows up consistently in conversations about large-scale crawls — typically when someone’s AWS bill from running Node-based crawlers reached a threshold that justified looking at a Rust alternative. The 2k-star spider-rs repo is smaller than the others; the hosted product is where most Spider usage actually lives.
ScrapeGraph sits in the fastest-growth bucket of the five. The 24.9k stars on a Python library that’s essentially “Playwright + an LLM” reflect how much demand there was for somebody to package the obvious idea well. The Ollama support is the most-mentioned feature in community threads — sensitive-data teams love that page content never has to leave their network. The honest critique is per-page cost at high volume, which is the inherent ceiling of the LLM-first approach.
Frequently asked questions
What's the fastest way to choose between Apify, Spider, Firecrawl, Crawlee, and ScrapeGraph?
Start with shape, not features. If you want a hosted API that returns clean markdown for an LLM, Firecrawl is the default. If you need a marketplace of pre-built scrapers for specific sites (Amazon listings, Google SERPs, Instagram profiles), Apify wins because Apify Store ships almost 29,000 ready-made actors. If you crawl hundreds of millions of pages a month and care about raw throughput, Spider's Rust core is the differentiator. If you want to build and own the pipeline in TypeScript or Python, Crawlee is the framework. And if you'd rather describe what you want in English than write CSS selectors, ScrapeGraph runs the page through an LLM. Pick one shape; the feature checklists matter only after.
Is Firecrawl really open-source, or is the OSS version crippled?
It's genuinely open-source under Apache 2.0 at github.com/firecrawl/firecrawl — the repo crosses 100k stars and ships the same core engine the hosted SaaS runs on. What you give up self-hosting is the managed scaling, the proxy pool, the SLA, and the AI-extraction add-on that depends on an LLM key. For small-to-medium teams that means cloning, supplying their own proxies, and running it on a couple of workers; for larger teams it usually means buying credits instead because the proxy-and-rotation operational cost exceeds the SaaS bill. Both are real options, which is why we cover both in the Firecrawl section below.
Apify vs Crawlee — they're from the same company, what's the difference?
Apify is the platform — a hosted SaaS where you run scrapers (called 'actors') on Apify's infrastructure, browse the marketplace, schedule jobs, and pay per compute unit. Crawlee is the open-source Node.js / Python library under Apache 2.0 that you install with npm or pip and run yourself. The same team builds both, so a Crawlee scraper deploys cleanly to Apify if you want to graduate from local execution to hosted scheduling. Pick Apify when the marketplace already covers your target site or you don't want to operate infrastructure. Pick Crawlee when you want to keep everything on your own workers and own the cost curve. The decision usually comes down to whether you're shipping a product or running an internal pipeline.
What's actually different about Spider's Rust core?
Spider is written in Rust with async concurrency from the ground up, and the open-core repo at github.com/spider-rs/spider has 2k+ stars. The practical impact is twofold: the per-page CPU cost is lower than Node-based crawlers, and the concurrency primitives let it sustain very high request rates without garbage-collection pauses. Spider's own marketing claims peaks of 100k+ pages per second on the hosted side, which you should treat as a best-case lab number — real-world throughput depends on target-site latency, proxy stack, and rate-limit policies. The honest summary: if you're crawling a small number of sites, Spider's speed is invisible. If you're crawling at the scale where compute is your biggest line item, the Rust core matters.
When does ScrapeGraph beat writing CSS selectors by hand?
ScrapeGraph wins on three jobs: prototypes where you're not sure what fields you want yet, scrapes against sites that redesign their DOM frequently, and one-off extractions where the marginal cost of writing selectors exceeds the cost of an LLM call. The losses are also predictable. CSS selectors win on cost when you're scraping the same template across millions of pages — running an LLM per page is slower and more expensive than a parsed-once selector. They also win on determinism: a model can hallucinate a field that doesn't exist, while a missing selector returns nothing. The 24.9k-star github.com/ScrapeGraphAI/Scrapegraph-ai repo is excellent for the prototype-and-pivot phase; production high-volume jobs usually graduate back to selectors plus a fallback LLM pass for the long tail.
Do any of these have MCP servers an AI agent can use?
Two of the five ship production MCP integrations today. Firecrawl has the most mature one — github.com/mendableai/firecrawl-mcp-server wraps /scrape, /crawl, /map, /extract, and /search behind a single stdio process, and the canonical install configs live on the /servers/firecrawl page on this directory. Apify ships its own MCP for invoking actors against a hosted account; see /servers/apify-actor for the install card with current configs. Spider, Crawlee, and ScrapeGraph don't have first-party MCP servers as of May 2026, though community wrappers exist for each. For agent-facing work in Cursor, Claude Code, VS Code, or Windsurf, Firecrawl plus Apify covers the read-and-extract surface most teams need.
Which one handles JavaScript-rendered pages best?
All five can render JavaScript, but they pay for it differently. Firecrawl auto-detects when a page needs a browser and switches engines under the hood — you don't think about it, you do pay one credit per page either way. Spider exposes browser rendering as an option you toggle. Apify Store ships actor variants for browser-heavy targets (most well-known Apify actors for Twitter, Instagram, and Amazon already use headless browsers). Crawlee gives you the choice explicitly: a CheerioCrawler for static HTML, a PlaywrightCrawler for JS-rendered pages, or a PuppeteerCrawler if you prefer that flavour. ScrapeGraph uses Playwright under the hood. For the developer experience, Firecrawl is the lightest touch; for explicit control, Crawlee plus Playwright is the canonical answer.
Can I scrape any site? What about robots.txt and ToS?
Technically you can scrape almost anything, legally it's nuanced, and ethically the answer is 'check first.' robots.txt is an honour-system signal — it doesn't enforce anything, but disrespecting it is a flag in any legal dispute and a moral problem most engineering teams don't want to argue about. Site Terms of Service usually prohibit scraping; courts in the US have repeatedly held public data scraping isn't a CFAA violation (see hiQ v. LinkedIn), but contract-based claims, copyright claims, and trespass-to-chattels claims are real. None of the five tools enforces this for you — Firecrawl and Spider let you opt in to robots.txt respect; Apify Store actor authors decide per-actor; Crawlee gives you the primitives but you write the policy; ScrapeGraph passes the page to an LLM regardless. Talk to legal before scraping a site you don't own at meaningful scale. This post is a tool comparison, not legal advice.
What about Playwright, Puppeteer, BeautifulSoup, Selenium?
They're the older toolchain and they still work. Playwright is the modern browser-automation framework; we cover it head-to-head with Firecrawl, Anycrawl, and Crawlee in the sibling post linked at the top of this page. Puppeteer (Chrome-only, Node.js, by Google) was Playwright's predecessor and remains widely deployed — Crawlee supports both interchangeably. Selenium is the cross-language test-automation grandparent; useful when your team already runs Selenium for QA, less compelling as a green-field scraping pick in 2026. BeautifulSoup is a Python HTML parser, not a crawler — pair it with httpx or aiohttp for a lightweight scraper, or use Crawlee's CheerioCrawler for the same shape in Node. None of these have first-party MCP servers; Microsoft's @playwright/mcp wraps Playwright for agent use, which is covered in our chrome-devtools-mcp vs playwright-mcp deep-dive.
How do I pick between Firecrawl's /extract and ScrapeGraph?
Both use an LLM to pull structured data from a page, but they sit at different layers. Firecrawl's /extract is a hosted endpoint — you POST a URL and a schema, you get back JSON, and you're billed per page in Firecrawl credits. The LLM is managed; the proxy stack is managed; the only thing you control is the schema. ScrapeGraph is a library you install yourself, you supply the LLM key (OpenAI, Anthropic, Groq, Gemini, or Ollama for local), and you run it on your own infrastructure. If you want zero infrastructure and an LLM bill rolled into your scraper bill, /extract is simpler. If you want to use your own model, control prompts, or run fully offline with a local LLM, ScrapeGraph is the right shape. Hybrid stacks are common: /extract for the hosted majority, ScrapeGraph for sensitive pages where data can't leave your network.
Sources
Apify
- apify.com/store — actor marketplace (almost 29,000 actors)
- apify.com/pricing — free, Starter, Scale, Business, Enterprise
- docs.apify.com/api/v2 — actor invocation API
Spider
- spider.cloud — hosted Rust crawling API
- spider.cloud/pricing — pay-per-GB pricing
- github.com/spider-rs/spider — open-source Rust core
Firecrawl
- firecrawl.dev — hosted SaaS
- github.com/firecrawl/firecrawl — Apache 2.0 open-source core (100k+ stars)
- github.com/mendableai/firecrawl-mcp-server — official MCP server
- firecrawl.dev/pricing — credit tiers
Crawlee
- crawlee.dev — documentation
- github.com/apify/crawlee — Apache 2.0, TypeScript (23.2k stars)
- github.com/apify/crawlee-python — Python port
ScrapeGraph
- scrapegraphai.com — project site
- github.com/ScrapeGraphAI/Scrapegraph-ai — MIT, Python (24.9k stars)
Related comparisons
- Firecrawl vs Anycrawl vs Crawlee vs Playwright (2026) — API-vs-library cut with Playwright
- Best web search MCP servers (2026)
- Chrome DevTools MCP vs Playwright MCP (2026)
Internal links
- /servers/apify-actor
- /servers/firecrawl
- /blog/what-is-mcp — protocol primer
- /best-mcp-servers — curated roundup
- /servers — browse all 3,000+