Adaptive Web Crawling

Introduction

Traditional web crawlers follow predetermined patterns, crawling pages blindly without knowing when they've gathered enough information. Adaptive Crawling changes this paradigm by introducing intelligence into the crawling process.

Think of it like research: when you're looking for information, you don't read every book in the library. You stop when you've found sufficient information to answer your question. That's exactly what Adaptive Crawling does for web scraping.

Key Concepts

The Problem It Solves

When crawling websites for specific information, you face two challenges: 1. Under-crawling: Stopping too early and missing crucial information 2. Over-crawling: Wasting resources by crawling irrelevant pages

Adaptive Crawling solves both by using a three-layer scoring system that determines when you have "enough" information.

How It Works

The AdaptiveCrawler uses three metrics to measure information sufficiency:

  • Coverage: How well your collected pages cover the query terms
  • Consistency: Whether the information is coherent across pages
  • Saturation: Detecting when new pages aren't adding new information

When these metrics indicate sufficient information has been gathered, crawling stops automatically.

Quick Start

Basic Usage

from crawl4ai import AsyncWebCrawler, AdaptiveCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        # Create an adaptive crawler
        adaptive = AdaptiveCrawler(crawler)

        # Start crawling with a query
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/",
            query="async context managers"
        )

        # View statistics
        adaptive.print_stats()

        # Get the most relevant content
        relevant_pages = adaptive.get_relevant_content(top_k=5)
        for page in relevant_pages:
            print(f"- {page['url']} (score: {page['score']:.2f})")

Configuration Options

from crawl4ai import AdaptiveConfig

config = AdaptiveConfig(
    confidence_threshold=0.7,    # Stop when 70% confident (default: 0.8)
    max_pages=20,               # Maximum pages to crawl (default: 50)
    top_k_links=3,              # Links to follow per page (default: 5)
    min_gain_threshold=0.05     # Minimum expected gain to continue (default: 0.1)
)

adaptive = AdaptiveCrawler(crawler, config=config)

Crawling Strategies

Adaptive Crawling supports two distinct strategies for determining information sufficiency:

Statistical Strategy (Default)

The statistical strategy uses pure information theory and term-based analysis:

  • Fast and efficient - No API calls or model loading
  • Term-based coverage - Analyzes query term presence and distribution
  • No external dependencies - Works offline
  • Best for: Well-defined queries with specific terminology
# Default configuration uses statistical strategy
config = AdaptiveConfig(
    strategy="statistical",  # This is the default
    confidence_threshold=0.8
)

Embedding Strategy

The embedding strategy uses semantic embeddings for deeper understanding:

  • Semantic understanding - Captures meaning beyond exact term matches
  • Query expansion - Automatically generates query variations
  • Gap-driven selection - Identifies semantic gaps in knowledge
  • Validation-based stopping - Uses held-out queries to validate coverage
  • Best for: Complex queries, ambiguous topics, conceptual understanding
# Configure embedding strategy
config = AdaptiveConfig(
    strategy="embedding",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",  # Default
    n_query_variations=10,  # Generate 10 query variations
    embedding_min_confidence_threshold=0.1  # Stop if completely irrelevant
)

# With custom embedding provider (e.g., OpenAI)
config = AdaptiveConfig(
    strategy="embedding",
    embedding_llm_config={
        'provider': 'openai/text-embedding-3-small',
        'api_token': 'your-api-key'
    }
)

Strategy Comparison

Feature Statistical Embedding
Speed Very fast Moderate (API calls)
Cost Free Depends on provider
Accuracy Good for exact terms Excellent for concepts
Dependencies None Embedding model/API
Query Understanding Literal Semantic
Best Use Case Technical docs, specific terms Research, broad topics

Embedding Strategy Configuration

The embedding strategy offers fine-tuned control through several parameters:

config = AdaptiveConfig(
    strategy="embedding",

    # Model configuration
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    embedding_llm_config=None,  # Use for API-based embeddings

    # Query expansion
    n_query_variations=10,  # Number of query variations to generate

    # Coverage parameters
    embedding_coverage_radius=0.2,  # Distance threshold for coverage
    embedding_k_exp=3.0,  # Exponential decay factor (higher = stricter)

    # Stopping criteria
    embedding_min_relative_improvement=0.1,  # Min improvement to continue
    embedding_validation_min_score=0.3,  # Min validation score
    embedding_min_confidence_threshold=0.1,  # Below this = irrelevant

    # Link selection
    embedding_overlap_threshold=0.85,  # Similarity for deduplication

    # Display confidence mapping
    embedding_quality_min_confidence=0.7,  # Min displayed confidence
    embedding_quality_max_confidence=0.95  # Max displayed confidence
)

Handling Irrelevant Queries

The embedding strategy can detect when a query is completely unrelated to the content:

# This will stop quickly with low confidence
result = await adaptive.digest(
    start_url="https://docs.python.org/3/",
    query="how to cook pasta"  # Irrelevant to Python docs
)

# Check if query was irrelevant
if result.metrics.get('is_irrelevant', False):
    print("Query is unrelated to the content!")

When to Use Adaptive Crawling

Perfect For:

  • Research Tasks: Finding comprehensive information about a topic
  • Question Answering: Gathering sufficient context to answer specific queries
  • Knowledge Base Building: Creating focused datasets for AI/ML applications
  • Competitive Intelligence: Collecting complete information about specific products/features
  • Full Site Archiving: When you need every page regardless of content
  • Structured Data Extraction: When targeting specific, known page patterns
  • Real-time Monitoring: When you need continuous updates

Understanding the Output

Confidence Score

The confidence score (0-1) indicates how sufficient the gathered information is: - 0.0-0.3: Insufficient information, needs more crawling - 0.3-0.6: Partial information, may answer basic queries - 0.6-0.8: Good coverage, can answer most queries - 0.8-1.0: Excellent coverage, comprehensive information

Statistics Display

adaptive.print_stats(detailed=False)  # Summary table
adaptive.print_stats(detailed=True)   # Detailed metrics

The summary shows: - Pages crawled vs. confidence achieved - Coverage, consistency, and saturation scores - Crawling efficiency metrics

Persistence and Resumption

Saving Progress

config = AdaptiveConfig(
    save_state=True,
    state_path="my_crawl_state.json"
)

# Crawl will auto-save progress
result = await adaptive.digest(start_url, query)

Resuming a Crawl

# Resume from saved state
result = await adaptive.digest(
    start_url,
    query,
    resume_from="my_crawl_state.json"
)

Exporting Knowledge Base

# Export collected pages to JSONL
adaptive.export_knowledge_base("knowledge_base.jsonl")

# Import into another session
new_adaptive = AdaptiveCrawler(crawler)
new_adaptive.import_knowledge_base("knowledge_base.jsonl")

Best Practices

1. Query Formulation

  • Use specific, descriptive queries
  • Include key terms you expect to find
  • Avoid overly broad queries

2. Threshold Tuning

  • Start with default (0.8) for general use
  • Lower to 0.6-0.7 for exploratory crawling
  • Raise to 0.9+ for exhaustive coverage

3. Performance Optimization

  • Use appropriate max_pages limits
  • Adjust top_k_links based on site structure
  • Enable caching for repeat crawls
  • The crawler prioritizes links based on:
  • Relevance to query
  • Expected information gain
  • URL structure and depth

Examples

Research Assistant

# Gather information about a programming concept
result = await adaptive.digest(
    start_url="https://realpython.com",
    query="python decorators implementation patterns"
)

# Get the most relevant excerpts
for doc in adaptive.get_relevant_content(top_k=3):
    print(f"\nFrom: {doc['url']}")
    print(f"Relevance: {doc['score']:.2%}")
    print(doc['content'][:500] + "...")

Knowledge Base Builder

# Build a focused knowledge base about machine learning
queries = [
    "supervised learning algorithms",
    "neural network architectures", 
    "model evaluation metrics"
]

for query in queries:
    await adaptive.digest(
        start_url="https://scikit-learn.org/stable/",
        query=query
    )

# Export combined knowledge base
adaptive.export_knowledge_base("ml_knowledge.jsonl")

API Documentation Crawler

# Intelligently crawl API documentation
config = AdaptiveConfig(
    confidence_threshold=0.85,  # Higher threshold for completeness
    max_pages=30
)

adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(
    start_url="https://api.example.com/docs",
    query="authentication endpoints rate limits"
)

Next Steps

FAQ

Q: How is this different from traditional crawling? A: Traditional crawling follows fixed patterns (BFS/DFS). Adaptive crawling makes intelligent decisions about which links to follow and when to stop based on information gain.

Q: Can I use this with JavaScript-heavy sites? A: Yes! AdaptiveCrawler inherits all capabilities from AsyncWebCrawler, including JavaScript execution.

Q: How does it handle large websites? A: The algorithm naturally limits crawling to relevant sections. Use max_pages as a safety limit.

Q: Can I customize the scoring algorithms? A: Advanced users can implement custom strategies. See Adaptive Strategies.


> Feedback