π Crawl4AI v0.7.5: The Docker Hooks & Security Update
September 29, 2025 β’ 8 min read
Today I'm releasing Crawl4AI v0.7.5βfocused on extensibility and security. This update introduces the Docker Hooks System for pipeline customization, enhanced LLM integration, and important security improvements.
π― What's New at a Glance
- Docker Hooks System: Custom Python functions at key pipeline points with function-based API
- Function-Based Hooks: New
hooks_to_string()utility with Docker client auto-conversion - Enhanced LLM Integration: Custom providers with temperature control
- HTTPS Preservation: Secure internal link handling
- Bug Fixes: Resolved multiple community-reported issues
- Improved Docker Error Handling: Better debugging and reliability
π§ Docker Hooks System: Pipeline Customization
Every scraping project needs custom logicβauthentication, performance optimization, content processing. Traditional solutions require forking or complex workarounds. Docker Hooks let you inject custom Python functions at 8 key points in the crawling pipeline.
Real Example: Authentication & Performance
import requests
# Real working hooks for httpbin.org
hooks_config = {
"on_page_context_created": """
async def hook(page, context, **kwargs):
print("Hook: Setting up page context")
# Block images to speed up crawling
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
print("Hook: Images blocked")
return page
""",
"before_retrieve_html": """
async def hook(page, context, **kwargs):
print("Hook: Before retrieving HTML")
# Scroll to bottom to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
print("Hook: Scrolled to bottom")
return page
""",
"before_goto": """
async def hook(page, context, url, **kwargs):
print(f"Hook: About to navigate to {url}")
# Add custom headers
await page.set_extra_http_headers({
'X-Test-Header': 'crawl4ai-hooks-test'
})
return page
"""
}
# Test with Docker API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {
"code": hooks_config,
"timeout": 30
}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
result = response.json()
if result.get('success'):
print("β
Hooks executed successfully!")
print(f"Content length: {len(result.get('markdown', ''))} characters")
Available Hook Points:
- on_browser_created: Browser setup
- on_page_context_created: Page context configuration
- before_goto: Pre-navigation setup
- after_goto: Post-navigation processing
- on_user_agent_updated: User agent changes
- on_execution_started: Crawl initialization
- before_retrieve_html: Pre-extraction processing
- before_return_html: Final HTML processing
Function-Based Hooks API
Writing hooks as strings works, but lacks IDE support and type checking. v0.7.5 introduces a function-based approach with automatic conversion!
Option 1: Using the hooks_to_string() Utility
from crawl4ai import hooks_to_string
import requests
# Define hooks as regular Python functions (with full IDE support!)
async def on_page_context_created(page, context, **kwargs):
"""Block images to speed up crawling"""
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
await page.set_viewport_size({"width": 1920, "height": 1080})
return page
async def before_goto(page, context, url, **kwargs):
"""Add custom headers"""
await page.set_extra_http_headers({
'X-Crawl4AI': 'v0.7.5',
'X-Custom-Header': 'my-value'
})
return page
# Convert functions to strings
hooks_code = hooks_to_string({
"on_page_context_created": on_page_context_created,
"before_goto": before_goto
})
# Use with REST API
payload = {
"urls": ["https://httpbin.org/html"],
"hooks": {"code": hooks_code, "timeout": 30}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
Option 2: Docker Client with Automatic Conversion (Recommended!)
from crawl4ai.docker_client import Crawl4aiDockerClient
# Define hooks as functions (same as above)
async def on_page_context_created(page, context, **kwargs):
await context.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort())
return page
async def before_retrieve_html(page, context, **kwargs):
# Scroll to load lazy content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)
return page
# Use Docker client - conversion happens automatically!
client = Crawl4aiDockerClient(base_url="http://localhost:11235")
results = await client.crawl(
urls=["https://httpbin.org/html"],
hooks={
"on_page_context_created": on_page_context_created,
"before_retrieve_html": before_retrieve_html
},
hooks_timeout=30
)
if results and results.success:
print(f"β
Hooks executed! HTML length: {len(results.html)}")
Benefits of Function-Based Hooks: - β Full IDE support (autocomplete, syntax highlighting) - β Type checking and linting - β Easier to test and debug - β Reusable across projects - β Automatic conversion in Docker client - β No breaking changes - string hooks still work!
π€ Enhanced LLM Integration
Enhanced LLM integration with custom providers, temperature control, and base URL configuration.
Multi-Provider Support
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
# Test with different providers
async def test_llm_providers():
# OpenAI with custom temperature
openai_strategy = LLMExtractionStrategy(
provider="gemini/gemini-2.5-flash-lite",
api_token="your-api-token",
temperature=0.7, # New in v0.7.5
instruction="Summarize this page in one sentence"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://example.com",
config=CrawlerRunConfig(extraction_strategy=openai_strategy)
)
if result.success:
print("β
LLM extraction completed")
print(result.extracted_content)
# Docker API with enhanced LLM config
llm_payload = {
"url": "https://example.com",
"f": "llm",
"q": "Summarize this page in one sentence.",
"provider": "gemini/gemini-2.5-flash-lite",
"temperature": 0.7
}
response = requests.post("http://localhost:11235/md", json=llm_payload)
New Features:
- Custom temperature parameter for creativity control
- base_url for custom API endpoints
- Multi-provider environment variable support
- Docker API integration
π HTTPS Preservation
The Problem: Modern web apps require HTTPS everywhere. When crawlers downgrade internal links from HTTPS to HTTP, authentication breaks and security warnings appear.
Solution: HTTPS preservation maintains secure protocols throughout crawling.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, FilterChain, URLPatternFilter, BFSDeepCrawlStrategy
async def test_https_preservation():
# Enable HTTPS preservation
url_filter = URLPatternFilter(
patterns=["^(https:\/\/)?quotes\.toscrape\.com(\/.*)?$"]
)
config = CrawlerRunConfig(
exclude_external_links=True,
preserve_https_for_internal_links=True, # New in v0.7.5
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
max_pages=5,
filter_chain=FilterChain([url_filter])
)
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun(
url="https://quotes.toscrape.com",
config=config
):
# All internal links maintain HTTPS
internal_links = [link['href'] for link in result.links['internal']]
https_links = [link for link in internal_links if link.startswith('https://')]
print(f"HTTPS links preserved: {len(https_links)}/{len(internal_links)}")
for link in https_links[:3]:
print(f" β {link}")
π οΈ Bug Fixes and Improvements
Major Fixes
- URL Processing: Fixed '+' sign preservation in query parameters (#1332)
- Proxy Configuration: Enhanced proxy string parsing (old
proxyparameter deprecated) - Docker Error Handling: Comprehensive error messages with status codes
- Memory Management: Fixed leaks in long-running sessions
- JWT Authentication: Fixed Docker JWT validation issues (#1442)
- Playwright Stealth: Fixed stealth features for Playwright integration (#1481)
- API Configuration: Fixed config handling to prevent overriding user-provided settings (#1505)
- Docker Filter Serialization: Resolved JSON encoding errors in deep crawl strategy (#1419)
- LLM Provider Support: Fixed custom LLM provider integration for adaptive crawler (#1291)
- Performance Issues: Resolved backoff strategy failures and timeout handling (#989)
Community-Reported Issues Fixed
This release addresses multiple issues reported by the community through GitHub issues and Discord discussions: - Fixed browser configuration reference errors - Resolved dependency conflicts with cssselect - Improved error messaging for failed authentications - Enhanced compatibility with various proxy configurations - Fixed edge cases in URL normalization
Configuration Updates
# Old proxy config (deprecated)
# browser_config = BrowserConfig(proxy="http://proxy:8080")
# New enhanced proxy config
browser_config = BrowserConfig(
proxy_config={
"server": "http://proxy:8080",
"username": "optional-user",
"password": "optional-pass"
}
)
π Breaking Changes
- Python 3.10+ Required: Upgrade from Python 3.9
- Proxy Parameter Deprecated: Use new
proxy_configstructure - New Dependency: Added
cssselectfor better CSS handling
π Get Started
# Install latest version
pip install crawl4ai==0.7.5
# Docker deployment
docker pull unclecode/crawl4ai:latest
docker run -p 11235:11235 unclecode/crawl4ai:latest
Try the Demo:
Resources: - π Documentation: docs.crawl4ai.com - π GitHub: github.com/unclecode/crawl4ai - π¬ Discord: discord.gg/crawl4ai - π¦ Twitter: @unclecode
Happy crawling! π·οΈ