Crawl4AI 0.4.3: Major Performance Boost & LLM Integration

We're excited to announce Crawl4AI 0.4.3, focusing on three key areas: Speed & Efficiency, LLM Integration, and Core Platform Improvements. This release significantly improves crawling performance while adding powerful new LLM-powered features.

⚑ Speed & Efficiency Improvements

1. Memory-Adaptive Dispatcher System

The new dispatcher system provides intelligent resource management and real-time monitoring:

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DisplayMode
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher, CrawlerMonitor

async def main():
    urls = ["https://example1.com", "https://example2.com"] * 50

    # Configure memory-aware dispatch
    dispatcher = MemoryAdaptiveDispatcher(
        memory_threshold_percent=80.0,  # Auto-throttle at 80% memory
        check_interval=0.5,             # Check every 0.5 seconds
        max_session_permit=20,          # Max concurrent sessions
        monitor=CrawlerMonitor(         # Real-time monitoring
            display_mode=DisplayMode.DETAILED
        )
    )

    async with AsyncWebCrawler() as crawler:
        results = await dispatcher.run_urls(
            urls=urls,
            crawler=crawler,
            config=CrawlerRunConfig()
        )

2. Streaming Support

Process crawled URLs in real-time instead of waiting for all results:

config = CrawlerRunConfig(stream=True)

async with AsyncWebCrawler() as crawler:
    async for result in await crawler.arun_many(urls, config=config):
        print(f"Got result for {result.url}")
        # Process each result immediately

3. LXML-Based Scraping

New LXML scraping strategy offering up to 20x faster parsing:

config = CrawlerRunConfig(
    scraping_strategy=LXMLWebScrapingStrategy(),
    cache_mode=CacheMode.ENABLED
)

πŸ€– LLM Integration

1. LLM-Powered Markdown Generation

Smart content filtering and organization using LLMs:

config = CrawlerRunConfig(
    markdown_generator=DefaultMarkdownGenerator(
        content_filter=LLMContentFilter(
            provider="openai/gpt-4o",
            instruction="Extract technical documentation and code examples"
        )
    )
)

2. Automatic Schema Generation

Generate extraction schemas instantly using LLMs instead of manual CSS/XPath writing:

schema = JsonCssExtractionStrategy.generate_schema(
    html_content,
    schema_type="CSS",
    query="Extract product name, price, and description"
)

πŸ”§ Core Improvements

1. Proxy Support & Rotation

Integrated proxy support with automatic rotation and verification:

config = CrawlerRunConfig(
    proxy_config={
        "server": "http://proxy:8080",
        "username": "user",
        "password": "pass"
    }
)

2. Robots.txt Compliance

Built-in robots.txt support with SQLite caching:

config = CrawlerRunConfig(check_robots_txt=True)
result = await crawler.arun(url, config=config)
if result.status_code == 403:
    print("Access blocked by robots.txt")

3. URL Redirection Tracking

Track final URLs after redirects:

result = await crawler.arun(url)
print(f"Initial URL: {url}")
print(f"Final URL: {result.redirected_url}")

Performance Impact

  • Memory usage reduced by up to 40% with adaptive dispatcher
  • Parsing speed increased up to 20x with LXML strategy
  • Streaming reduces memory footprint for large crawls by ~60%

Getting Started

pip install -U crawl4ai

For complete examples, check our demo repository.

Stay Connected

Happy crawling! πŸ•·οΈ


> Feedback