Adaptive Web Crawling

Introduction

Traditional web crawlers follow predetermined patterns, crawling pages blindly without knowing when they've gathered enough information. Adaptive Crawling changes this paradigm by introducing intelligence into the crawling process.

Think of it like research: when you're looking for information, you don't read every book in the library. You stop when you've found sufficient information to answer your question. That's exactly what Adaptive Crawling does for web scraping.

Key Concepts

The Problem It Solves

When crawling websites for specific information, you face two challenges: 1. Under-crawling: Stopping too early and missing crucial information 2. Over-crawling: Wasting resources by crawling irrelevant pages

Adaptive Crawling solves both by using a three-layer scoring system that determines when you have "enough" information.

How It Works

The AdaptiveCrawler uses three metrics to measure information sufficiency:

Coverage: How well your collected pages cover the query terms
Consistency: Whether the information is coherent across pages
Saturation: Detecting when new pages aren't adding new information

When these metrics indicate sufficient information has been gathered, crawling stops automatically.

Quick Start

Basic Usage

from crawl4ai import AsyncWebCrawler, AdaptiveCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        # Create an adaptive crawler
        adaptive = AdaptiveCrawler(crawler)

        # Start crawling with a query
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/",
            query="async context managers"
        )

        # View statistics
        adaptive.print_stats()

        # Get the most relevant content
        relevant_pages = adaptive.get_relevant_content(top_k=5)
        for page in relevant_pages:
            print(f"- {page['url']} (score: {page['score']:.2f})")

Configuration Options

from crawl4ai import AdaptiveConfig

config = AdaptiveConfig(
    confidence_threshold=0.7,    # Stop when 70% confident (default: 0.8)
    max_pages=20,               # Maximum pages to crawl (default: 50)
    top_k_links=3,              # Links to follow per page (default: 5)
    min_gain_threshold=0.05     # Minimum expected gain to continue (default: 0.1)
)

adaptive = AdaptiveCrawler(crawler, config=config)

Crawling Strategies

Adaptive Crawling supports two distinct strategies for determining information sufficiency:

Statistical Strategy (Default)

The statistical strategy uses pure information theory and term-based analysis:

Fast and efficient - No API calls or model loading
Term-based coverage - Analyzes query term presence and distribution
No external dependencies - Works offline
Best for: Well-defined queries with specific terminology

# Default configuration uses statistical strategy
config = AdaptiveConfig(
    strategy="statistical",  # This is the default
    confidence_threshold=0.8
)

Embedding Strategy

The embedding strategy uses semantic embeddings for deeper understanding:

Semantic understanding - Captures meaning beyond exact term matches
Query expansion - Automatically generates query variations
Gap-driven selection - Identifies semantic gaps in knowledge
Validation-based stopping - Uses held-out queries to validate coverage
Best for: Complex queries, ambiguous topics, conceptual understanding

# Configure embedding strategy
config = AdaptiveConfig(
    strategy="embedding",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",  # Default
    n_query_variations=10,  # Generate 10 query variations
    embedding_min_confidence_threshold=0.1  # Stop if completely irrelevant
)

# With custom embedding provider (e.g., OpenAI)
config = AdaptiveConfig(
    strategy="embedding",
    embedding_llm_config={
        'provider': 'openai/text-embedding-3-small',
        'api_token': 'your-api-key'
    }
)

Strategy Comparison

Feature	Statistical	Embedding
Speed	Very fast	Moderate (API calls)
Cost	Free	Depends on provider
Accuracy	Good for exact terms	Excellent for concepts
Dependencies	None	Embedding model/API
Query Understanding	Literal	Semantic
Best Use Case	Technical docs, specific terms	Research, broad topics

Embedding Strategy Configuration

The embedding strategy offers fine-tuned control through several parameters:

config = AdaptiveConfig(
    strategy="embedding",

    # Model configuration
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    embedding_llm_config=None,  # Use for API-based embeddings

    # Query expansion
    n_query_variations=10,  # Number of query variations to generate

    # Coverage parameters
    embedding_coverage_radius=0.2,  # Distance threshold for coverage
    embedding_k_exp=3.0,  # Exponential decay factor (higher = stricter)

    # Stopping criteria
    embedding_min_relative_improvement=0.1,  # Min improvement to continue
    embedding_validation_min_score=0.3,  # Min validation score
    embedding_min_confidence_threshold=0.1,  # Below this = irrelevant

    # Link selection
    embedding_overlap_threshold=0.85,  # Similarity for deduplication

    # Display confidence mapping
    embedding_quality_min_confidence=0.7,  # Min displayed confidence
    embedding_quality_max_confidence=0.95  # Max displayed confidence
)

Handling Irrelevant Queries

The embedding strategy can detect when a query is completely unrelated to the content:

# This will stop quickly with low confidence
result = await adaptive.digest(
    start_url="https://docs.python.org/3/",
    query="how to cook pasta"  # Irrelevant to Python docs
)

# Check if query was irrelevant
if result.metrics.get('is_irrelevant', False):
    print("Query is unrelated to the content!")

When to Use Adaptive Crawling

Perfect For:

Research Tasks: Finding comprehensive information about a topic
Question Answering: Gathering sufficient context to answer specific queries
Knowledge Base Building: Creating focused datasets for AI/ML applications
Competitive Intelligence: Collecting complete information about specific products/features

Not Recommended For:

Full Site Archiving: When you need every page regardless of content
Structured Data Extraction: When targeting specific, known page patterns
Real-time Monitoring: When you need continuous updates

Understanding the Output

Confidence Score

The confidence score (0-1) indicates how sufficient the gathered information is: - 0.0-0.3: Insufficient information, needs more crawling - 0.3-0.6: Partial information, may answer basic queries - 0.6-0.8: Good coverage, can answer most queries - 0.8-1.0: Excellent coverage, comprehensive information

Statistics Display

adaptive.print_stats(detailed=False)  # Summary table
adaptive.print_stats(detailed=True)   # Detailed metrics

The summary shows: - Pages crawled vs. confidence achieved - Coverage, consistency, and saturation scores - Crawling efficiency metrics

Persistence and Resumption

Saving Progress

config = AdaptiveConfig(
    save_state=True,
    state_path="my_crawl_state.json"
)

# Crawl will auto-save progress
result = await adaptive.digest(start_url, query)

Resuming a Crawl

# Resume from saved state
result = await adaptive.digest(
    start_url,
    query,
    resume_from="my_crawl_state.json"
)

Exporting Knowledge Base

# Export collected pages to JSONL
adaptive.export_knowledge_base("knowledge_base.jsonl")

# Import into another session
new_adaptive = AdaptiveCrawler(crawler)
new_adaptive.import_knowledge_base("knowledge_base.jsonl")

Best Practices

1. Query Formulation

Use specific, descriptive queries
Include key terms you expect to find
Avoid overly broad queries

2. Threshold Tuning

Start with default (0.8) for general use
Lower to 0.6-0.7 for exploratory crawling
Raise to 0.9+ for exhaustive coverage

3. Performance Optimization

Use appropriate max_pages limits
Adjust top_k_links based on site structure
Enable caching for repeat crawls

4. Link Selection

The crawler prioritizes links based on:
Relevance to query
Expected information gain
URL structure and depth

Examples

Research Assistant

# Gather information about a programming concept
result = await adaptive.digest(
    start_url="https://realpython.com",
    query="python decorators implementation patterns"
)

# Get the most relevant excerpts
for doc in adaptive.get_relevant_content(top_k=3):
    print(f"\nFrom: {doc['url']}")
    print(f"Relevance: {doc['score']:.2%}")
    print(doc['content'][:500] + "...")

Knowledge Base Builder

# Build a focused knowledge base about machine learning
queries = [
    "supervised learning algorithms",
    "neural network architectures", 
    "model evaluation metrics"
]

for query in queries:
    await adaptive.digest(
        start_url="https://scikit-learn.org/stable/",
        query=query
    )

# Export combined knowledge base
adaptive.export_knowledge_base("ml_knowledge.jsonl")

API Documentation Crawler

# Intelligently crawl API documentation
config = AdaptiveConfig(
    confidence_threshold=0.85,  # Higher threshold for completeness
    max_pages=30
)

adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(
    start_url="https://api.example.com/docs",
    query="authentication endpoints rate limits"
)

Next Steps

Learn about Advanced Adaptive Strategies
Explore the AdaptiveCrawler API Reference
See more Examples

FAQ

Q: How is this different from traditional crawling? A: Traditional crawling follows fixed patterns (BFS/DFS). Adaptive crawling makes intelligent decisions about which links to follow and when to stop based on information gain.

Q: Can I use this with JavaScript-heavy sites? A: Yes! AdaptiveCrawler inherits all capabilities from AsyncWebCrawler, including JavaScript execution.

Q: How does it handle large websites? A: The algorithm naturally limits crawling to relevant sections. Use max_pages as a safety limit.

Q: Can I customize the scoring algorithms? A: Advanced users can implement custom strategies. See Adaptive Strategies.