# Detailed Outline for crawl4ai - config_objects Component **Target Document Type:** memory **Target Output Filename Suggestion:** `llm_memory_config_objects.md` **Library Version Context:** 0.6.3 **Outline Generation Date:** 2024-05-24 --- ## 1. Introduction to Configuration Objects in Crawl4ai * **1.1. Purpose of Configuration Objects** * Explanation: Configuration objects in `crawl4ai` serve to centralize and manage settings for various components and behaviors of the library. This includes browser setup, individual crawl run parameters, LLM provider interactions, proxy settings, and more. * Benefit: This approach enhances code readability by grouping related settings, improves maintainability by providing a clear structure for configurations, and offers ease of customization for users to tailor the library's behavior to their specific needs. * **1.2. General Principles and Usage** * **1.2.1. Immutability/Cloning:** * Concept: Most configuration objects are designed with a `clone()` method, allowing users to create modified copies without altering the original configuration instance. This promotes safer state management, especially when reusing base configurations for multiple tasks. * Method: `clone(**kwargs)` on most configuration objects. * **1.2.2. Serialization and Deserialization:** * Concept: `crawl4ai` configuration objects can be serialized to dictionary format (e.g., for saving to JSON) and deserialized back into their respective class instances. * Methods: * `dump() -> dict`: Serializes the object to a dictionary suitable for JSON, often using the internal `to_serializable_dict` helper. * `load(data: dict) -> ConfigClass` (Static Method): Deserializes an object from a dictionary, often using the internal `from_serializable_dict` helper. * `to_dict() -> dict`: Converts the object to a standard Python dictionary. * `from_dict(data: dict) -> ConfigClass` (Static Method): Creates an instance from a standard Python dictionary. * Helper Functions: * `crawl4ai.async_configs.to_serializable_dict(obj: Any, ignore_default_value: bool = False) -> Dict`: Recursively converts objects into a serializable dictionary format, handling complex types like enums and nested objects. * `crawl4ai.async_configs.from_serializable_dict(data: Any) -> Any`: Reconstructs Python objects from the serializable dictionary format. * **1.3. Scope of this Document** * Statement: This document provides a factual API reference for the primary configuration objects within the `crawl4ai` library, detailing their purpose, initialization parameters, attributes, and key methods. ## 2. Core Configuration Objects ### 2.1. `BrowserConfig` Located in `crawl4ai.async_configs`. * **2.1.1. Purpose:** * Description: The `BrowserConfig` class is used to configure the settings for a browser instance and its associated contexts when using browser-based crawler strategies like `AsyncPlaywrightCrawlerStrategy`. It centralizes all parameters that affect the creation and behavior of the browser. * **2.1.2. Initialization (`__init__`)** * Signature: ```python class BrowserConfig: def __init__( self, browser_type: str = "chromium", headless: bool = True, browser_mode: str = "dedicated", use_managed_browser: bool = False, cdp_url: Optional[str] = None, use_persistent_context: bool = False, user_data_dir: Optional[str] = None, chrome_channel: Optional[str] = "chromium", # Note: 'channel' is preferred channel: Optional[str] = "chromium", proxy: Optional[str] = None, proxy_config: Optional[Union[ProxyConfig, dict]] = None, viewport_width: int = 1080, viewport_height: int = 600, viewport: Optional[dict] = None, accept_downloads: bool = False, downloads_path: Optional[str] = None, storage_state: Optional[Union[str, dict]] = None, ignore_https_errors: bool = True, java_script_enabled: bool = True, sleep_on_close: bool = False, verbose: bool = True, cookies: Optional[List[dict]] = None, headers: Optional[dict] = None, user_agent: Optional[str] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36", user_agent_mode: Optional[str] = "", user_agent_generator_config: Optional[dict] = None, # Default is {} in __init__ text_mode: bool = False, light_mode: bool = False, extra_args: Optional[List[str]] = None, debugging_port: int = 9222, host: str = "localhost" ): ... ``` * Parameters: * `browser_type (str, default: "chromium")`: Specifies the browser engine to use. Supported values: `"chromium"`, `"firefox"`, `"webkit"`. * `headless (bool, default: True)`: If `True`, runs the browser without a visible GUI. Set to `False` for debugging or visual interaction. * `browser_mode (str, default: "dedicated")`: Defines how the browser is initialized. Options: `"builtin"` (uses built-in CDP), `"dedicated"` (new instance each time), `"cdp"` (connects to an existing CDP endpoint specified by `cdp_url`), `"docker"` (runs browser in a Docker container). * `use_managed_browser (bool, default: False)`: If `True`, launches the browser using a managed approach (e.g., via CDP or Docker), allowing for more advanced control. Automatically set to `True` if `browser_mode` is `"builtin"`, `"docker"`, or if `cdp_url` is provided, or if `use_persistent_context` is `True`. * `cdp_url (Optional[str], default: None)`: The URL for the Chrome DevTools Protocol (CDP) endpoint. If not provided and `use_managed_browser` is active, it might be set by an internal browser manager. * `use_persistent_context (bool, default: False)`: If `True`, uses a persistent browser context (profile), saving cookies, localStorage, etc., across sessions. Requires `user_data_dir`. Sets `use_managed_browser=True`. * `user_data_dir (Optional[str], default: None)`: Path to a directory for storing user data for persistent sessions. If `None` and `use_persistent_context` is `True`, a temporary directory might be used. * `chrome_channel (Optional[str], default: "chromium")`: Specifies the Chrome channel (e.g., "chrome", "msedge", "chromium-beta"). Only applicable if `browser_type` is "chromium". * `channel (Optional[str], default: "chromium")`: Preferred alias for `chrome_channel`. Set to `""` for Firefox or WebKit. * `proxy (Optional[str], default: None)`: A string representing the proxy server URL (e.g., "http://username:password@proxy.example.com:8080"). * `proxy_config (Optional[Union[ProxyConfig, dict]], default: None)`: A `ProxyConfig` object or a dictionary specifying detailed proxy settings. Overrides the `proxy` string if both are provided. * `viewport_width (int, default: 1080)`: Default width of the browser viewport in pixels. * `viewport_height (int, default: 600)`: Default height of the browser viewport in pixels. * `viewport (Optional[dict], default: None)`: A dictionary specifying viewport dimensions, e.g., `{"width": 1920, "height": 1080}`. If set, overrides `viewport_width` and `viewport_height`. * `accept_downloads (bool, default: False)`: If `True`, allows files to be downloaded by the browser. * `downloads_path (Optional[str], default: None)`: Directory path where downloaded files will be stored. Required if `accept_downloads` is `True`. * `storage_state (Optional[Union[str, dict]], default: None)`: Path to a JSON file or a dictionary containing the browser's storage state (cookies, localStorage, etc.) to load. * `ignore_https_errors (bool, default: True)`: If `True`, HTTPS certificate errors will be ignored. * `java_script_enabled (bool, default: True)`: If `True`, JavaScript execution is enabled on web pages. * `sleep_on_close (bool, default: False)`: If `True`, introduces a small delay before the browser is closed. * `verbose (bool, default: True)`: If `True`, enables verbose logging for browser operations. * `cookies (Optional[List[dict]], default: None)`: A list of cookie dictionaries to be set in the browser context. Each dictionary should conform to Playwright's cookie format. * `headers (Optional[dict], default: None)`: A dictionary of additional HTTP headers to be sent with every request made by the browser. * `user_agent (Optional[str], default: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36")`: The User-Agent string the browser will use. * `user_agent_mode (Optional[str], default: "")`: Mode for generating the User-Agent string. If set (e.g., to "random"), `user_agent_generator_config` can be used. * `user_agent_generator_config (Optional[dict], default: {})`: Configuration dictionary for the User-Agent generator if `user_agent_mode` is active. * `text_mode (bool, default: False)`: If `True`, attempts to disable images and other rich content to potentially speed up loading for text-focused crawls. * `light_mode (bool, default: False)`: If `True`, disables certain background browser features for potential performance gains. * `extra_args (Optional[List[str]], default: None)`: A list of additional command-line arguments to pass to the browser executable upon launch. * `debugging_port (int, default: 9222)`: The port to use for the browser's remote debugging protocol (CDP). * `host (str, default: "localhost")`: The host on which the browser's remote debugging protocol will listen. * **2.1.3. Key Public Attributes/Properties:** * All parameters listed in `__init__` are available as public attributes with the same names and types. * `browser_hint (str)`: [Read-only] - A string representing client hints (Sec-CH-UA) generated based on the `user_agent` string. This is automatically set during initialization. * **2.1.4. Key Public Methods:** * `from_kwargs(cls, kwargs: dict) -> BrowserConfig` (Static Method): * Purpose: Creates a `BrowserConfig` instance from a dictionary of keyword arguments. * `to_dict(self) -> dict`: * Purpose: Converts the `BrowserConfig` instance into a dictionary representation. * `clone(self, **kwargs) -> BrowserConfig`: * Purpose: Creates a deep copy of the current `BrowserConfig` instance. Keyword arguments can be provided to override specific attributes in the new instance. * `dump(self) -> dict`: * Purpose: Serializes the `BrowserConfig` object into a dictionary format that is suitable for JSON storage or transmission, utilizing the `to_serializable_dict` helper. * `load(cls, data: dict) -> BrowserConfig` (Static Method): * Purpose: Deserializes a `BrowserConfig` object from a dictionary (typically one created by `dump()`), utilizing the `from_serializable_dict` helper. ### 2.2. `CrawlerRunConfig` Located in `crawl4ai.async_configs`. * **2.2.1. Purpose:** * Description: The `CrawlerRunConfig` class encapsulates all settings that control the behavior of a single crawl operation performed by `AsyncWebCrawler.arun()` or multiple operations within `AsyncWebCrawler.arun_many()`. This includes parameters for content processing, page interaction, caching, and media handling. * **2.2.2. Initialization (`__init__`)** * Signature: ```python class CrawlerRunConfig: def __init__( self, url: Optional[str] = None, word_count_threshold: int = MIN_WORD_THRESHOLD, extraction_strategy: Optional[ExtractionStrategy] = None, chunking_strategy: Optional[ChunkingStrategy] = RegexChunking(), markdown_generator: Optional[MarkdownGenerationStrategy] = DefaultMarkdownGenerator(), only_text: bool = False, css_selector: Optional[str] = None, target_elements: Optional[List[str]] = None, # Default is [] in __init__ excluded_tags: Optional[List[str]] = None, # Default is [] in __init__ excluded_selector: Optional[str] = "", # Default is "" in __init__ keep_data_attributes: bool = False, keep_attrs: Optional[List[str]] = None, # Default is [] in __init__ remove_forms: bool = False, prettify: bool = False, parser_type: str = "lxml", scraping_strategy: Optional[ContentScrapingStrategy] = None, # Instantiated with WebScrapingStrategy() if None proxy_config: Optional[Union[ProxyConfig, dict]] = None, proxy_rotation_strategy: Optional[ProxyRotationStrategy] = None, locale: Optional[str] = None, timezone_id: Optional[str] = None, geolocation: Optional[GeolocationConfig] = None, fetch_ssl_certificate: bool = False, cache_mode: CacheMode = CacheMode.BYPASS, session_id: Optional[str] = None, shared_data: Optional[dict] = None, wait_until: str = "domcontentloaded", page_timeout: int = PAGE_TIMEOUT, wait_for: Optional[str] = None, wait_for_timeout: Optional[int] = None, wait_for_images: bool = False, delay_before_return_html: float = 0.1, mean_delay: float = 0.1, max_range: float = 0.3, semaphore_count: int = 5, js_code: Optional[Union[str, List[str]]] = None, js_only: bool = False, ignore_body_visibility: bool = True, scan_full_page: bool = False, scroll_delay: float = 0.2, process_iframes: bool = False, remove_overlay_elements: bool = False, simulate_user: bool = False, override_navigator: bool = False, magic: bool = False, adjust_viewport_to_content: bool = False, screenshot: bool = False, screenshot_wait_for: Optional[float] = None, screenshot_height_threshold: int = SCREENSHOT_HEIGHT_THRESHOLD, pdf: bool = False, capture_mhtml: bool = False, image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, image_score_threshold: int = IMAGE_SCORE_THRESHOLD, table_score_threshold: int = 7, exclude_external_images: bool = False, exclude_all_images: bool = False, exclude_social_media_domains: Optional[List[str]] = None, # Uses SOCIAL_MEDIA_DOMAINS if None exclude_external_links: bool = False, exclude_social_media_links: bool = False, exclude_domains: Optional[List[str]] = None, # Default is [] in __init__ exclude_internal_links: bool = False, verbose: bool = True, log_console: bool = False, capture_network_requests: bool = False, capture_console_messages: bool = False, method: str = "GET", stream: bool = False, check_robots_txt: bool = False, user_agent: Optional[str] = None, user_agent_mode: Optional[str] = None, user_agent_generator_config: Optional[dict] = None, # Default is {} in __init__ deep_crawl_strategy: Optional[DeepCrawlStrategy] = None, experimental: Optional[Dict[str, Any]] = None # Default is {} in __init__ ): ... ``` * Parameters: * `url (Optional[str], default: None)`: The target URL for this specific crawl run. * `word_count_threshold (int, default: MIN_WORD_THRESHOLD)`: Minimum word count for a text block to be considered significant during content processing. * `extraction_strategy (Optional[ExtractionStrategy], default: None)`: Strategy for extracting structured data from the page. If `None`, `NoExtractionStrategy` is used. * `chunking_strategy (Optional[ChunkingStrategy], default: RegexChunking())`: Strategy to split content into chunks before extraction. * `markdown_generator (Optional[MarkdownGenerationStrategy], default: DefaultMarkdownGenerator())`: Strategy for converting HTML to Markdown. * `only_text (bool, default: False)`: If `True`, attempts to extract only textual content, potentially ignoring structural elements beneficial for rich Markdown. * `css_selector (Optional[str], default: None)`: A CSS selector defining the primary region of the page to focus on for content extraction. The raw HTML is reduced to this region. * `target_elements (Optional[List[str]], default: [])`: A list of CSS selectors. If provided, only the content within these elements will be considered for Markdown generation and structured data extraction. Unlike `css_selector`, this does not reduce the raw HTML but scopes the processing. * `excluded_tags (Optional[List[str]], default: [])`: A list of HTML tag names (e.g., "nav", "footer") to be removed from the HTML before processing. * `excluded_selector (Optional[str], default: "")`: A CSS selector specifying elements to be removed from the HTML before processing. * `keep_data_attributes (bool, default: False)`: If `True`, `data-*` attributes on HTML elements are preserved during cleaning. * `keep_attrs (Optional[List[str]], default: [])`: A list of specific HTML attribute names to preserve during HTML cleaning. * `remove_forms (bool, default: False)`: If `True`, all `