crawl4ai
This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
Install
mkdir -p .claude/skills/crawl4ai && curl -L -o skill.zip "https://mcp.directory/api/skills/download/265" && unzip -o skill.zip -d .claude/skills/crawl4ai && rm skill.zipInstalls to .claude/skills/crawl4ai
About this skill
Crawl4AI
Overview
This skill provides comprehensive support for web crawling and data extraction using the Crawl4AI library, including the complete SDK reference, ready-to-use scripts for common patterns, and optimized workflows for efficient data extraction.
Quick Start
Installation Check
# Verify installation
crawl4ai-doctor
# If issues, run setup
crawl4ai-setup
Basic First Crawl
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500]) # First 500 chars
asyncio.run(main())
Using Provided Scripts
# Simple markdown extraction
python scripts/basic_crawler.py https://example.com
# Batch processing
python scripts/batch_crawler.py urls.txt
# Data extraction
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
Core Crawling Fundamentals
1. Basic Crawling
Understanding the core components for any crawl:
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
# Browser configuration (controls browser behavior)
browser_config = BrowserConfig(
headless=True, # Run without GUI
viewport_width=1920,
viewport_height=1080,
user_agent="custom-agent" # Optional custom user agent
)
# Crawler configuration (controls crawl behavior)
crawler_config = CrawlerRunConfig(
page_timeout=30000, # 30 seconds timeout
screenshot=True, # Take screenshot
remove_overlay_elements=True # Remove popups/overlays
)
# Execute crawl with arun()
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=crawler_config
)
# CrawlResult contains everything
print(f"Success: {result.success}")
print(f"HTML length: {len(result.html)}")
print(f"Markdown length: {len(result.markdown)}")
print(f"Links found: {len(result.links)}")
2. Configuration Deep Dive
BrowserConfig - Controls the browser instance:
headless: Run with/without GUIviewport_width/height: Browser dimensionsuser_agent: Custom user agent stringcookies: Pre-set cookiesheaders: Custom HTTP headers
CrawlerRunConfig - Controls each crawl:
page_timeout: Maximum page load/JS execution time (ms)wait_for: CSS selector or JS condition to wait for (optional)cache_mode: Control caching behaviorjs_code: Execute custom JavaScriptscreenshot: Capture page screenshotsession_id: Persist session across crawls
3. Content Processing
Basic content operations available in every crawl:
result = await crawler.arun(url)
# Access extracted content
markdown = result.markdown # Clean markdown
html = result.html # Raw HTML
text = result.cleaned_html # Cleaned HTML
# Media and links
images = result.media["images"]
videos = result.media["videos"]
internal_links = result.links["internal"]
external_links = result.links["external"]
# Metadata
title = result.metadata["title"]
description = result.metadata["description"]
Markdown Generation (Primary Use Case)
1. Basic Markdown Extraction
Crawl4AI excels at generating clean, well-formatted markdown:
# Simple markdown extraction
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.example.com")
# High-quality markdown ready for LLMs
with open("documentation.md", "w") as f:
f.write(result.markdown)
2. Fit Markdown (Content Filtering)
Use content filters to get only relevant content:
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Option 1: Pruning filter (removes low-quality content)
pruning_filter = PruningContentFilter(threshold=0.4, threshold_type="fixed")
# Option 2: BM25 filter (relevance-based filtering)
bm25_filter = BM25ContentFilter(user_query="machine learning tutorials", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
# Access filtered content
print(result.markdown.fit_markdown) # Filtered markdown
print(result.markdown.raw_markdown) # Original markdown
3. Markdown Customization
Control markdown generation with options:
config = CrawlerRunConfig(
# Exclude elements from markdown
excluded_tags=["nav", "footer", "aside"],
# Focus on specific CSS selector
css_selector=".main-content",
# Clean up formatting
remove_forms=True,
remove_overlay_elements=True,
# Control link handling
exclude_external_links=True,
exclude_internal_links=False
)
# Custom markdown generation
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
generator = DefaultMarkdownGenerator(
options={
"ignore_links": False,
"ignore_images": False,
"image_alt_text": True
}
)
Data Extraction
1. Schema-Based Extraction (Most Efficient)
For repetitive patterns, generate schema once and reuse:
# Step 1: Generate schema with LLM (one-time)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# Step 2: Use schema for fast extraction (no LLM)
python scripts/extraction_pipeline.py --use-schema https://shop.com generated_schema.json
2. Manual CSS/JSON Extraction
When you know the structure:
schema = {
"name": "articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"},
{"name": "content", "selector": ".content", "type": "text"}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema=schema)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
3. LLM-Based Extraction
For complex or irregular content:
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
instruction="Extract key financial metrics and quarterly trends"
)
Advanced Patterns
1. Deep Crawling
Discover and crawl links from a page:
# Basic link discovery
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url)
# Extract and process discovered links
internal_links = result.links.get("internal", [])
external_links = result.links.get("external", [])
# Crawl discovered internal links
for link in internal_links:
if "/blog/" in link and "/tag/" not in link: # Filter links
sub_result = await crawler.arun(link)
# Process sub-page
# For advanced deep crawling, consider using URL seeding patterns
# or custom crawl strategies (see complete-sdk-reference.md)
2. Batch & Multi-URL Processing
Efficiently crawl multiple URLs:
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
async with AsyncWebCrawler() as crawler:
# Concurrent crawling with arun_many()
results = await crawler.arun_many(
urls=urls,
config=crawler_config,
max_concurrent=5 # Control concurrency
)
for result in results:
if result.success:
print(f"✅ {result.url}: {len(result.markdown)} chars")
3. Session & Authentication
Handle login-required content:
# First crawl - establish session and login
login_config = CrawlerRunConfig(
session_id="user_session",
js_code="""
document.querySelector('#username').value = 'myuser';
document.querySelector('#password').value = 'mypass';
document.querySelector('#submit').click();
""",
wait_for="css:.dashboard" # Wait for post-login element
)
await crawler.arun("https://site.com/login", config=login_config)
# Subsequent crawls - reuse session
config = CrawlerRunConfig(session_id="user_session")
await crawler.arun("https://site.com/protected-content", config=config)
4. Dynamic Content Handling
For JavaScript-heavy sites:
config = CrawlerRunConfig(
# Wait for dynamic content
wait_for="css:.ajax-content",
# Execute JavaScript
js_code="""
// Scroll to load content
window.scrollTo(0, document.body.scrollHeight);
// Click load more button
document.querySelector('.load-more')?.click();
""",
# Note: For virtual scrolling (Twitter/Instagram-style),
# use virtual_scroll_config parameter (see docs)
# Extended timeout for slow loading
page_timeout=60000
)
5. Anti-Detection & Proxies
Avoid bot detection:
# Proxy configuration
browser_config = BrowserConfig(
headless=True,
proxy_config={
"server": "http://proxy.server:8080",
"username": "user",
"password": "pass"
}
)
# For stealth/undetected browsing, consider:
# - Rotating user agents via user_agent parameter
# - Using different viewport sizes
# - Adding delays between requests
# Rate limiting
import asyncio
for url in urls:
result = await crawler.arun(url)
await asyncio.sleep(2) # Delay between requests
Common Use Cases
Documentation to Markdown
# Convert entire documentation site to clean markdown
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.example.com")
# Save as markdown for LLM consumption
with open("docs.md", "w") as f:
f.write(result.markdown)
E-commerce Product Monitoring
# Generate schema once for product pages
# Then monitor prices/availability without LLM costs
schema = load_json("product_schema.json")
products = await crawler.arun_many(product_urls,
config=CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema)))
News Aggregation
# Crawl multiple news sources concurrently
news_urls = ["https://news1.com", "https://news2.com", "https://news3.com"]
results = await crawler.arun_many(news_urls, max_concurrent=5)
# Extract articles with Fit Markdown
for result in results:
if result.success:
# Get only relevant content
article = result.fit_markdown
Research & Data Collection
# Academic paper collection with focused extraction
config = CrawlerRunConfig(
fit_markdown=True,
fit_markdown_options={
"query": "machine learning transformers",
"max_tokens": 10000
}
)
Resources
scripts/
- extraction_pipeline.py - Three extraction approaches with schema generation
- basic_crawler.py - Simple markdown extraction with screenshots
- batch_crawler.py - Multi-URL concurrent processing
references/
- complete-sdk-reference.md - Complete SDK documentation (23K words) with all parameters, methods, and advanced features
Example Code Repository
The Crawl4AI repository includes extensive examples in docs/examples/:
Core Examples
- quickstart.py - Comprehensive starter with all basic patterns:
- Simple crawling, JavaScript execution, CSS selectors
- Content filtering, link analysis, media handling
- LLM extraction, CSS extraction, dynamic content
- Browser comparison, SSL certificates
Specialized Examples
- amazon_product_extraction_*.py - Three approaches for e-commerce scraping
- extraction_strategies_examples.py - All extraction strategies demonstrated
- deepcrawl_example.py - Advanced deep crawling patterns
- crypto_analysis_example.py - Complex data extraction with analysis
- parallel_execution_example.py - High-performance concurrent crawling
- session_management_example.py - Authentication and session handling
- markdown_generation_example.py - Advanced markdown customization
- hooks_example.py - Custom hooks for crawl lifecycle events
- proxy_rotation_example.py - Proxy management and rotation
- router_example.py - Request routing and URL patterns
Advanced Patterns
- adaptive_crawling/ - Intelligent crawling strategies
- c4a_script/ - C4A script examples
- docker_*.py - Docker deployment patterns
To explore examples:
# The examples are located in your Crawl4AI installation:
# Look in: docs/examples/ directory
# Start with quickstart.py for comprehensive patterns
# It includes: simple crawl, JS execution, CSS selectors,
# content filtering, LLM extraction, dynamic pages, and more
# For specific use cases:
# - E-commerce: amazon_product_extraction_*.py
# - High performance: parallel_execution_example.py
# - Authentication: session_management_example.py
# - Deep crawling: deepcrawl_example.py
# Run any example directly:
# python docs/examples/quickstart.py
Best Practices
- Start with basic crawling - Understand BrowserConfig, CrawlerRunConfig, and arun() before moving to advanced features
- Use markdown generation for documentation and content - Crawl4AI excels at clean markdown extraction
- Try schema generation first for structured data - 10-100x more efficient than LLM extraction
- Enable caching during development -
cache_mode=CacheMode.ENABLEDto avoid repeated requests - Set appropriate timeouts - 30s for normal sites, 60s+ for JavaScript-heavy sites
- Respect rate limits - Use delays and
max_concurrentparameter - Reuse sessions for authenticated content instead of re-logging
Troubleshooting
JavaScript not loading:
config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for specific element
page_timeout=60000 # Increase timeout
)
Bot detection issues:
browser_config = BrowserConfig(
headless=False, # Sometimes visible browsing helps
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
# Add delays between requests
await asyncio.sleep(random.uniform(2, 5))
Content extraction problems:
# Debug what's being extracted
result = await crawler.arun(url)
print(f"HTML length: {len(result.html)}")
print(f"Markdown length: {len(result.markdown)}")
print(f"Links found: {len(result.links)}")
# Try different wait strategies
config = CrawlerRunConfig(
wait_for="js:document.querySelector('.content') !== null"
)
Session/auth issues:
# Verify session is maintained
config = CrawlerRunConfig(session_id="test_session")
result = await crawler.arun(url, config=config)
print(f"Session ID: {result.session_id}")
print(f"Cookies: {result.cookies}")
For more details on any topic, refer to references/complete-sdk-reference.md which contains comprehensive documentation of all features, parameters, and advanced usage patterns.
More by basher83
View all →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
rust-coding-skill
UtakataKyosui
Guides Claude in writing idiomatic, efficient, well-structured Rust code using proper data modeling, traits, impl organization, macros, and build-speed best practices.
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.