Anti-Bot Detection & Fallback

When crawling sites protected by anti-bot systems (Akamai, Cloudflare, PerimeterX, DataDome, Imperva, etc.), requests often get blocked with CAPTCHAs, 403 responses, or empty pages. Crawl4AI provides a layered retry and fallback system that automatically detects blocking and escalates through multiple strategies until content is retrieved.

How Detection Works

After each crawl attempt, Crawl4AI inspects the HTTP status code and HTML content for known anti-bot signals:

  • HTTP 403/429 with short or empty response bodies
  • Challenge pages β€” Cloudflare "Just a moment", Akamai "Access Denied", PerimeterX block pages
  • CAPTCHA injection β€” reCAPTCHA, hCaptcha, or vendor-specific challenges on otherwise empty pages
  • Firewall blocks β€” Imperva/Incapsula resource iframes, Sucuri firewall pages, Cloudflare error codes

Detection uses structural HTML markers (specific element IDs, script sources, form actions) rather than generic keywords to minimize false positives. A normal page that happens to mention "CAPTCHA" or "Cloudflare" in its content will not be flagged.

When all attempts fail and blocking is still detected, the result is returned with success=False and error_message describing the block reason.

Configuration Options

All anti-bot retry options live on CrawlerRunConfig:

Parameter Type Default Description
proxy_config ProxyConfig, list[ProxyConfig], or None None Single proxy or ordered list of proxies to try. Each retry round iterates through the full list. Use "direct" or ProxyConfig.DIRECT in a list to explicitly try without a proxy.
max_retries int 0 Number of retry rounds when blocking is detected. 0 = no retries.
fallback_fetch_function async (str) -> str None Async function called as last resort. Takes URL, returns raw HTML.

Escalation Chain

Each retry round tries every proxy in proxy_config in order. If all rounds are exhausted and the page is still blocked, the fallback fetch function is called as a last resort.

For each round (1 + max_retries rounds):
    1. Try proxy_config[0] (or direct if proxy_config is None)
    2. If blocked β†’ try proxy_config[1]
    3. If blocked β†’ try proxy_config[2]
    4. ... continue through all proxies
    5. If any attempt succeeds β†’ done

If all rounds exhausted and still blocked:
    6. Call fallback_fetch_function(url) β†’ process returned HTML

Worst-case attempts before the fetch function: (1 + max_retries) x len(proxy_config)

Crawl Stats

Every crawl result includes a crawl_stats dict with detailed attempt tracking:

result.crawl_stats = {
    "attempts": 3,                    # total browser attempts made
    "retries": 1,                     # retry rounds used (0 = succeeded first round)
    "proxies_used": [                 # ordered list of every attempt
        {"proxy": None,               "status_code": 403, "blocked": True,  "reason": "Akamai block (Reference #)"},
        {"proxy": "proxy.io:8080",    "status_code": 403, "blocked": True,  "reason": "Akamai block (Reference #)"},
        {"proxy": "premium.io:9090",  "status_code": 200, "blocked": False, "reason": ""},
    ],
    "fallback_fetch_used": False,     # whether fallback_fetch_function was called
    "resolved_by": "proxy",           # "direct" | "proxy" | "fallback_fetch" | null (all failed)
}

Usage Examples

Simple Retry (No Proxy)

Retry the crawl up to 3 times when blocking is detected. Useful when blocks are intermittent or IP-based.

from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
    result = await crawler.arun(
        url="https://example.com",
        config=CrawlerRunConfig(max_retries=3),
    )

Single Proxy

Pass a single ProxyConfig β€” it's used on every attempt. Same behavior as always.

from crawl4ai.async_configs import ProxyConfig

config = CrawlerRunConfig(
    max_retries=2,
    proxy_config=ProxyConfig(
        server="http://proxy.example.com:8080",
        username="user",
        password="pass",
    ),
)

Direct-First, Then Proxies

Try without a proxy first, then escalate to proxies if blocked. Use ProxyConfig.DIRECT (or the string "direct") in the list to represent a no-proxy attempt.

config = CrawlerRunConfig(
    max_retries=1,
    proxy_config=[
        ProxyConfig.DIRECT,  # Try without proxy first
        ProxyConfig(
            server="http://datacenter-proxy.example.com:8080",
            username="user",
            password="pass",
        ),
        ProxyConfig(
            server="http://residential-proxy.example.com:9090",
            username="user",
            password="pass",
        ),
    ],
)

With this setup, each round tries direct first, then datacenter, then residential. With max_retries=1, worst case is 2 rounds x 3 steps = 6 attempts.

Proxy List (Escalation)

Pass a list of proxies. They're tried in order β€” first one that works wins. Within each retry round, the entire list is tried again.

config = CrawlerRunConfig(
    max_retries=1,
    proxy_config=[
        ProxyConfig(
            server="http://datacenter-proxy.example.com:8080",
            username="user",
            password="pass",
        ),
        ProxyConfig(
            server="http://residential-proxy.example.com:9090",
            username="user",
            password="pass",
        ),
    ],
)

With this setup, each round tries the datacenter proxy first, then the residential proxy. With max_retries=1, worst case is 2 rounds x 2 proxies = 4 attempts.

Fallback Fetch Function

When all browser-based attempts fail, call a custom async function as a last resort. This function receives the URL and must return raw HTML as a string. The returned HTML is processed through the normal pipeline (markdown generation, extraction, etc.).

This is useful when you have access to a scraping API, a pre-fetched cache, or any other source of HTML.

import aiohttp

async def my_scraping_api(url: str) -> str:
    """Fetch HTML via an external scraping API."""
    async with aiohttp.ClientSession() as session:
        async with session.get(
            "https://api.my-scraping-service.com/fetch",
            params={"url": url, "format": "html"},
            headers={"Authorization": "Bearer MY_TOKEN"},
        ) as resp:
            if resp.status == 200:
                return await resp.text()
            raise RuntimeError(f"API error: {resp.status}")

config = CrawlerRunConfig(
    max_retries=1,
    fallback_fetch_function=my_scraping_api,
)

The function can do anything β€” call an API, read from a database, return cached HTML, or make a simple HTTP request with a different library. Crawl4AI does not care how the HTML is obtained.

Full Escalation (All Features Combined)

This example combines every layer: stealth mode, a list of proxies tried in order, retries, and a final fetch function.

import aiohttp
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, ProxyConfig

# Last-resort: fetch HTML via an external service
async def external_fetch(url: str) -> str:
    async with aiohttp.ClientSession() as session:
        async with session.post(
            "https://api.my-service.com/scrape",
            json={"url": url, "render_js": True},
            headers={"Authorization": "Bearer MY_TOKEN"},
        ) as resp:
            return await resp.text()

browser_config = BrowserConfig(
    headless=True,
    enable_stealth=True,
)

crawl_config = CrawlerRunConfig(
    magic=True,
    wait_until="load",
    max_retries=2,

    # Proxies tried in order β€” cheapest first
    proxy_config=[
        ProxyConfig(
            server="http://datacenter-proxy.example.com:8080",
            username="user",
            password="pass",
        ),
        ProxyConfig(
            server="http://residential-proxy.example.com:9090",
            username="user",
            password="pass",
        ),
    ],

    # Last resort β€” called after all retries and proxies are exhausted
    fallback_fetch_function=external_fetch,
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        url="https://protected-site.com/products",
        config=crawl_config,
    )

    if result.success:
        print(f"Got {len(result.markdown.raw_markdown)} chars of markdown")
        print(f"Resolved by: {result.crawl_stats['resolved_by']}")
        print(f"Attempts: {result.crawl_stats['attempts']}")
    else:
        print(f"All attempts failed: {result.error_message}")

What happens step by step:

Round Attempt What runs
1 1 Datacenter proxy β€” blocked
1 2 Residential proxy β€” blocked
2 1 Datacenter proxy β€” blocked
2 2 Residential proxy β€” blocked
3 1 Datacenter proxy β€” blocked
3 2 Residential proxy β€” blocked
- - external_fetch(url) called β€” returns HTML

That's up to 6 browser attempts + 1 function call before giving up.

Tips

  • Start with max_retries=0 and a fallback_fetch_function if you just want a safety net without burning time on retries.
  • Order proxies cheapest-first β€” datacenter proxies before residential, residential before premium.
  • Combine with stealth mode β€” BrowserConfig(enable_stealth=True) and CrawlerRunConfig(magic=True) reduce the chance of being blocked in the first place.
  • wait_until="load" is important for anti-bot sites β€” the default domcontentloaded can return before the anti-bot sensor finishes.
  • Check crawl_stats to understand what happened β€” how many attempts, which proxy worked, whether the fallback function was needed.

See Also


> Feedback