Table Extraction Strategies

Overview

New in v0.7.3+: Table extraction now follows the Strategy Design Pattern, providing unprecedented flexibility and power for handling different table structures. Don't worry - your existing code still works! We maintain full backward compatibility while offering new capabilities.

What's Changed?

Architecture: Table extraction now uses pluggable strategies
Backward Compatible: Your existing code with table_score_threshold continues to work
More Power: Choose from multiple strategies or create your own
Same Default Behavior: By default, uses DefaultTableExtraction (same as before)

Key Points

✅ Old code still works - No breaking changes
✅ Same default behavior - Uses the proven extraction algorithm
✅ New capabilities - Add LLM extraction or custom strategies when needed
✅ Strategy pattern - Clean, extensible architecture

Quick Start

The Simplest Way (Works Like Before)

If you're already using Crawl4AI, nothing changes:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def extract_tables():
    async with AsyncWebCrawler() as crawler:
        # This works exactly like before - uses DefaultTableExtraction internally
        result = await crawler.arun("https://example.com/data")

        # Tables are automatically extracted and available in result.tables
        for table in result.tables:
            print(f"Table with {len(table['rows'])} rows and {len(table['headers'])} columns")
            print(f"Headers: {table['headers']}")
            print(f"First row: {table['rows'][0] if table['rows'] else 'No data'}")

asyncio.run(extract_tables())

Using the Old Configuration (Still Supported)

Your existing code with table_score_threshold continues to work:

# This old approach STILL WORKS - we maintain backward compatibility
config = CrawlerRunConfig(
    table_score_threshold=7  # Internally creates DefaultTableExtraction(table_score_threshold=7)
)
result = await crawler.arun(url, config)

Table Extraction Strategies

Understanding the Strategy Pattern

The strategy pattern allows you to choose different table extraction algorithms at runtime. Think of it as having different tools in a toolbox - you pick the right one for the job:

No explicit strategy? → Uses DefaultTableExtraction automatically (same as v0.7.2 and earlier)
Need complex table handling? → Choose LLMTableExtraction (costs money, use sparingly)
Want to disable tables? → Use NoTableExtraction
Have special requirements? → Create a custom strategy

Available Strategies

Strategy	Description	Use Case	Cost	When to Use
`DefaultTableExtraction`	RECOMMENDED: Same algorithm as before v0.7.3	General purpose (default)	Free	Use this first - handles 95% of cases
`LLMTableExtraction`	AI-powered extraction for complex tables	Tables with complex rowspan/colspan	$$$ Per API call	Only when DefaultTableExtraction fails
`NoTableExtraction`	Disables table extraction	When tables aren't needed	Free	For text-only extraction
Custom strategies	User-defined extraction logic	Specialized requirements	Free	Domain-specific needs

⚠️ CRITICAL COST WARNING for LLMTableExtraction:

DO NOT USE LLMTableExtraction UNLESS ABSOLUTELY NECESSARY!

Always try DefaultTableExtraction first - It's free and handles most tables perfectly

LLM extraction costs money with every API call

For large tables (100+ rows), LLM extraction can be very slow

For large tables: If you must use LLM, choose fast providers:

✅ Groq (fastest inference)

✅ Cerebras (optimized for speed)

⚠️ Avoid: OpenAI, Anthropic for large tables (slower)

🚧 WORK IN PROGRESS: We are actively developing an advanced non-LLM algorithm that will handle complex table structures (rowspan, colspan, nested tables) for FREE. This will replace the need for costly LLM extraction in most cases. Coming soon!

DefaultTableExtraction

The default strategy uses a sophisticated scoring system to identify data tables:

from crawl4ai import DefaultTableExtraction, CrawlerRunConfig

# Customize the default extraction
table_strategy = DefaultTableExtraction(
    table_score_threshold=7,  # Scoring threshold (default: 7)
    min_rows=2,               # Minimum rows required
    min_cols=2,               # Minimum columns required
    verbose=True              # Enable detailed logging
)

config = CrawlerRunConfig(
    table_extraction=table_strategy
)

Scoring System

The scoring system evaluates multiple factors:

Factor	Score Impact	Description
Has `<thead>`	+2	Semantic table structure
Has `<tbody>`	+1	Organized table body
Has `<th>` elements	+2	Header cells present
Headers in correct position	+1	Proper semantic structure
Consistent column count	+2	Regular data structure
Has caption	+2	Descriptive caption
Has summary	+1	Summary attribute
High text density	+2 to +3	Content-rich cells
Data attributes	+0.5 each	Data-* attributes
Nested tables	-3	Often indicates layout
Role="presentation"	-3	Explicitly non-data
Too few rows	-2	Insufficient data

LLMTableExtraction (Use Sparingly!)

⚠️ WARNING: Only use this when DefaultTableExtraction fails with complex tables!

LLMTableExtraction uses AI to understand complex table structures that traditional parsers struggle with. It automatically handles large tables through intelligent chunking and parallel processing:

from crawl4ai import LLMTableExtraction, LLMConfig, CrawlerRunConfig

# Configure LLM (costs money per call!)
llm_config = LLMConfig(
    provider="groq/llama-3.3-70b-versatile",  # Fast provider for large tables
    api_token="your_api_key",
    temperature=0.1
)

# Create LLM extraction strategy with smart chunking
table_strategy = LLMTableExtraction(
    llm_config=llm_config,
    max_tries=3,                      # Retry up to 3 times if extraction fails
    css_selector="table",             # Optional: focus on specific tables
    enable_chunking=True,             # Automatically chunk large tables (default: True)
    chunk_token_threshold=3000,       # Split tables larger than this (default: 3000 tokens)
    min_rows_per_chunk=10,            # Minimum rows per chunk (default: 10)
    max_parallel_chunks=5,            # Process up to 5 chunks in parallel (default: 5)
    verbose=True
)

config = CrawlerRunConfig(
    table_extraction=table_strategy
)

result = await crawler.arun(url, config)

When to Use LLMTableExtraction

✅ Use ONLY when: - Tables have complex merged cells (rowspan/colspan) that break DefaultTableExtraction - Nested tables that need semantic understanding - Tables with irregular structures - You've tried DefaultTableExtraction and it failed

❌ Never use when: - DefaultTableExtraction works (99% of cases) - Tables are simple or well-structured - You're processing many pages (costs add up!) - Tables have 100+ rows (very slow)

How Smart Chunking Works

LLMTableExtraction automatically handles large tables through intelligent chunking:

Automatic Detection: Tables exceeding the token threshold are automatically split
Smart Splitting: Chunks are created at row boundaries, preserving table structure
Header Preservation: Each chunk includes the original headers for context
Parallel Processing: Multiple chunks are processed simultaneously for speed
Intelligent Merging: Results are merged back into a single, complete table

Chunking Parameters: - enable_chunking (default: True): Automatically handle large tables - chunk_token_threshold (default: 3000): When to split tables - min_rows_per_chunk (default: 10): Ensures meaningful chunk sizes - max_parallel_chunks (default: 5): Concurrent processing for speed

The chunking is completely transparent - you get the same output format whether the table was processed in one piece or multiple chunks.

Performance Optimization for LLMTableExtraction

Provider Recommendations by Table Size:

Table Size	Recommended Providers	Why
Small (<50 rows)	Any provider	Fast enough
Medium (50-200 rows)	Groq, Cerebras	Optimized inference
Large (200+ rows)	Groq (best), Cerebras	Fastest inference + automatic chunking
Very Large (500+ rows)	Groq with chunking	Parallel processing keeps it fast

NoTableExtraction

Disable table extraction for better performance when tables aren't needed:

from crawl4ai import NoTableExtraction, CrawlerRunConfig

config = CrawlerRunConfig(
    table_extraction=NoTableExtraction()
)

# Tables won't be extracted, improving performance
result = await crawler.arun(url, config)
assert len(result.tables) == 0

Extracted Table Structure

Each extracted table contains:

{
    "headers": ["Column 1", "Column 2", ...],  # Column headers
    "rows": [                                   # Data rows
        ["Row 1 Col 1", "Row 1 Col 2", ...],
        ["Row 2 Col 1", "Row 2 Col 2", ...],
    ],
    "caption": "Table Caption",                # If present
    "summary": "Table Summary",                # If present
    "metadata": {
        "row_count": 10,                       # Number of rows
        "column_count": 3,                      # Number of columns
        "has_headers": True,                    # Headers detected
        "has_caption": True,                    # Caption exists
        "has_summary": False,                   # Summary exists
        "id": "data-table-1",                   # Table ID if present
        "class": "financial-data"               # Table class if present
    }
}

Configuration Options

Basic Configuration

config = CrawlerRunConfig(
    # Table extraction settings
    table_score_threshold=7,      # Default threshold (backward compatible)
    table_extraction=strategy,     # Optional: custom strategy

    # Filter what to process
    css_selector="main",          # Focus on specific area
    excluded_tags=["nav", "aside"] # Exclude page sections
)

Advanced Configuration

from crawl4ai import DefaultTableExtraction, CrawlerRunConfig

# Fine-tuned extraction
strategy = DefaultTableExtraction(
    table_score_threshold=5,      # Lower = more permissive
    min_rows=3,                   # Require at least 3 rows
    min_cols=2,                   # Require at least 2 columns
    verbose=True                  # Detailed logging
)

config = CrawlerRunConfig(
    table_extraction=strategy,
    css_selector="article.content", # Target specific content
    exclude_domains=["ads.com"],   # Exclude ad domains
    cache_mode=CacheMode.BYPASS    # Fresh extraction
)

Working with Extracted Tables

Convert to Pandas DataFrame

import pandas as pd

async def tables_to_dataframes(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url)

        dataframes = []
        for table_data in result.tables:
            # Create DataFrame
            if table_data['headers']:
                df = pd.DataFrame(
                    table_data['rows'],
                    columns=table_data['headers']
                )
            else:
                df = pd.DataFrame(table_data['rows'])

            # Add metadata as DataFrame attributes
            df.attrs['caption'] = table_data.get('caption', '')
            df.attrs['metadata'] = table_data.get('metadata', {})

            dataframes.append(df)

        return dataframes

Filter Tables by Criteria

async def extract_large_tables(url):
    async with AsyncWebCrawler() as crawler:
        # Configure minimum size requirements
        strategy = DefaultTableExtraction(
            min_rows=10,
            min_cols=3,
            table_score_threshold=6
        )

        config = CrawlerRunConfig(
            table_extraction=strategy
        )

        result = await crawler.arun(url, config)

        # Further filter results
        large_tables = [
            table for table in result.tables
            if table['metadata']['row_count'] > 10
            and table['metadata']['column_count'] > 3
        ]

        return large_tables

Export Tables to Different Formats

import json
import csv

async def export_tables(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url)

        for i, table in enumerate(result.tables):
            # Export as JSON
            with open(f'table_{i}.json', 'w') as f:
                json.dump(table, f, indent=2)

            # Export as CSV
            with open(f'table_{i}.csv', 'w', newline='') as f:
                writer = csv.writer(f)
                if table['headers']:
                    writer.writerow(table['headers'])
                writer.writerows(table['rows'])

            # Export as Markdown
            with open(f'table_{i}.md', 'w') as f:
                # Write headers
                if table['headers']:
                    f.write('| ' + ' | '.join(table['headers']) + ' |\n')
                    f.write('|' + '---|' * len(table['headers']) + '\n')

                # Write rows
                for row in table['rows']:
                    f.write('| ' + ' | '.join(str(cell) for cell in row) + ' |\n')

Creating Custom Strategies

Extend TableExtractionStrategy to create custom extraction logic:

Example: Financial Table Extractor

from crawl4ai import TableExtractionStrategy
from typing import List, Dict, Any
import re

class FinancialTableExtractor(TableExtractionStrategy):
    """Extract tables containing financial data."""

    def __init__(self, currency_symbols=None, require_numbers=True, **kwargs):
        super().__init__(**kwargs)
        self.currency_symbols = currency_symbols or ['$', '€', '£', '¥']
        self.require_numbers = require_numbers
        self.number_pattern = re.compile(r'\d+[,.]?\d*')

    def extract_tables(self, element, **kwargs):
        tables_data = []

        for table in element.xpath(".//table"):
            # Check if table contains financial indicators
            table_text = ''.join(table.itertext())

            # Must contain currency symbols
            has_currency = any(sym in table_text for sym in self.currency_symbols)
            if not has_currency:
                continue

            # Must contain numbers if required
            if self.require_numbers:
                numbers = self.number_pattern.findall(table_text)
                if len(numbers) < 3:  # Arbitrary minimum
                    continue

            # Extract the table data
            table_data = self._extract_financial_data(table)
            if table_data:
                tables_data.append(table_data)

        return tables_data

    def _extract_financial_data(self, table):
        """Extract and clean financial data from table."""
        headers = []
        rows = []

        # Extract headers
        for th in table.xpath(".//thead//th | .//tr[1]//th"):
            headers.append(th.text_content().strip())

        # Extract and clean rows
        for tr in table.xpath(".//tbody//tr | .//tr[position()>1]"):
            row = []
            for td in tr.xpath(".//td"):
                text = td.text_content().strip()
                # Clean currency formatting
                text = re.sub(r'[$€£¥,]', '', text)
                row.append(text)
            if row:
                rows.append(row)

        return {
            "headers": headers,
            "rows": rows,
            "caption": self._get_caption(table),
            "summary": table.get("summary", ""),
            "metadata": {
                "type": "financial",
                "row_count": len(rows),
                "column_count": len(headers) or len(rows[0]) if rows else 0
            }
        }

    def _get_caption(self, table):
        caption = table.xpath(".//caption/text()")
        return caption[0].strip() if caption else ""

# Usage
strategy = FinancialTableExtractor(
    currency_symbols=['$', 'EUR'],
    require_numbers=True
)

config = CrawlerRunConfig(
    table_extraction=strategy
)

Example: Specific Table Extractor

class SpecificTableExtractor(TableExtractionStrategy):
    """Extract only tables matching specific criteria."""

    def __init__(self, 
                 required_headers=None, 
                 id_pattern=None,
                 class_pattern=None,
                 **kwargs):
        super().__init__(**kwargs)
        self.required_headers = required_headers or []
        self.id_pattern = id_pattern
        self.class_pattern = class_pattern

    def extract_tables(self, element, **kwargs):
        tables_data = []

        for table in element.xpath(".//table"):
            # Check ID pattern
            if self.id_pattern:
                table_id = table.get('id', '')
                if not re.match(self.id_pattern, table_id):
                    continue

            # Check class pattern
            if self.class_pattern:
                table_class = table.get('class', '')
                if not re.match(self.class_pattern, table_class):
                    continue

            # Extract headers to check requirements
            headers = self._extract_headers(table)

            # Check if required headers are present
            if self.required_headers:
                if not all(req in headers for req in self.required_headers):
                    continue

            # Extract full table data
            table_data = self._extract_table_data(table, headers)
            tables_data.append(table_data)

        return tables_data

Combining with Other Strategies

Table extraction works seamlessly with other Crawl4AI strategies:

from crawl4ai import (
    AsyncWebCrawler,
    CrawlerRunConfig,
    DefaultTableExtraction,
    LLMExtractionStrategy,
    JsonCssExtractionStrategy
)

async def combined_extraction(url):
    async with AsyncWebCrawler() as crawler:
        config = CrawlerRunConfig(
            # Table extraction
            table_extraction=DefaultTableExtraction(
                table_score_threshold=6,
                min_rows=2
            ),

            # CSS-based extraction for specific elements
            extraction_strategy=JsonCssExtractionStrategy({
                "title": "h1",
                "summary": "p.summary",
                "date": "time"
            }),

            # Focus on main content
            css_selector="main.content"
        )

        result = await crawler.arun(url, config)

        # Access different extraction results
        tables = result.tables  # Table data
        structured = json.loads(result.extracted_content)  # CSS extraction

        return {
            "tables": tables,
            "structured_data": structured,
            "markdown": result.markdown
        }

Performance Considerations

Optimization Tips

Disable when not needed: Use NoTableExtraction if tables aren't required
Target specific areas: Use css_selector to limit processing scope
Set minimum thresholds: Filter out small/irrelevant tables early
Cache results: Use appropriate cache modes for repeated extractions

# Optimized configuration for large pages
config = CrawlerRunConfig(
    # Only process main content area
    css_selector="article.main-content",

    # Exclude navigation and sidebars
    excluded_tags=["nav", "aside", "footer"],

    # Higher threshold for stricter filtering
    table_extraction=DefaultTableExtraction(
        table_score_threshold=8,
        min_rows=5,
        min_cols=3
    ),

    # Enable caching for repeated access
    cache_mode=CacheMode.ENABLED
)

Migration Guide

Important: Your Code Still Works!

No changes required! The transition to the strategy pattern is fully backward compatible.

How It Works Internally

v0.7.2 and Earlier

# Old way - directly passing table_score_threshold
config = CrawlerRunConfig(
    table_score_threshold=7
)
# Internally: No strategy pattern, direct implementation

v0.7.3+ (Current)

# Old way STILL WORKS - we handle it internally
config = CrawlerRunConfig(
    table_score_threshold=7
)
# Internally: Automatically creates DefaultTableExtraction(table_score_threshold=7)

Taking Advantage of New Features

While your old code works, you can now use the strategy pattern for more control:

# Option 1: Keep using the old way (perfectly fine!)
config = CrawlerRunConfig(
    table_score_threshold=7  # Still supported
)

# Option 2: Use the new strategy pattern (more flexibility)
from crawl4ai import DefaultTableExtraction

strategy = DefaultTableExtraction(
    table_score_threshold=7,
    min_rows=2,  # New capability!
    min_cols=2   # New capability!
)

config = CrawlerRunConfig(
    table_extraction=strategy
)

# Option 3: Use advanced strategies when needed
from crawl4ai import LLMTableExtraction, LLMConfig

# Only for complex tables that DefaultTableExtraction can't handle
# Automatically handles large tables with smart chunking
llm_strategy = LLMTableExtraction(
    llm_config=LLMConfig(
        provider="groq/llama-3.3-70b-versatile",
        api_token="your_key"
    ),
    max_tries=3,
    enable_chunking=True,  # Automatically chunk large tables
    chunk_token_threshold=3000,  # Chunk when exceeding 3000 tokens
    max_parallel_chunks=5  # Process up to 5 chunks in parallel
)

config = CrawlerRunConfig(
    table_extraction=llm_strategy  # Advanced extraction with automatic chunking
)

Summary

✅ No breaking changes - Old code works as-is
✅ Same defaults - DefaultTableExtraction is automatically used
✅ Gradual adoption - Use new features when you need them
✅ Full compatibility - result.tables structure unchanged

Best Practices

1. Choose the Right Strategy (Cost-Conscious Approach)

Decision Flow:

1. Do you need tables? 
   → No: Use NoTableExtraction
   → Yes: Continue to #2

2. Try DefaultTableExtraction first (FREE)
   → Works? Done! ✅
   → Fails? Continue to #3

3. Is the table critical and complex?
   → No: Accept DefaultTableExtraction results
   → Yes: Continue to #4

4. Use LLMTableExtraction (COSTS MONEY)
   → Small table (<50 rows): Any LLM provider
   → Large table (50+ rows): Use Groq or Cerebras
   → Very large (500+ rows): Reconsider - maybe chunk the page

Strategy Selection Guide: - DefaultTableExtraction: Use for 99% of cases - it's free and effective - LLMTableExtraction: Only for complex tables with merged cells that break DefaultTableExtraction - NoTableExtraction: When you only need text/markdown content - Custom Strategy: For specialized requirements (financial, scientific, etc.)

2. Validate Extracted Data

def validate_table(table):
    """Validate table data quality."""
    # Check structure
    if not table.get('rows'):
        return False

    # Check consistency
    if table.get('headers'):
        expected_cols = len(table['headers'])
        for row in table['rows']:
            if len(row) != expected_cols:
                return False

    # Check minimum content
    total_cells = sum(len(row) for row in table['rows'])
    non_empty = sum(1 for row in table['rows'] 
                    for cell in row if cell.strip())

    if non_empty / total_cells < 0.5:  # Less than 50% non-empty
        return False

    return True

# Filter valid tables
valid_tables = [t for t in result.tables if validate_table(t)]

3. Handle Edge Cases

async def robust_table_extraction(url):
    """Extract tables with error handling."""
    async with AsyncWebCrawler() as crawler:
        try:
            config = CrawlerRunConfig(
                table_extraction=DefaultTableExtraction(
                    table_score_threshold=6,
                    verbose=True
                )
            )

            result = await crawler.arun(url, config)

            if not result.success:
                print(f"Crawl failed: {result.error}")
                return []

            # Process tables safely
            processed_tables = []
            for table in result.tables:
                try:
                    # Validate and process
                    if validate_table(table):
                        processed_tables.append(table)
                except Exception as e:
                    print(f"Error processing table: {e}")
                    continue

            return processed_tables

        except Exception as e:
            print(f"Extraction error: {e}")
            return []

Troubleshooting

Common Issues and Solutions

Issue	Cause	Solution
No tables extracted	Score too high	Lower `table_score_threshold`
Layout tables included	Score too low	Increase `table_score_threshold`
Missing tables	CSS selector too specific	Broaden or remove `css_selector`
Incomplete data	Complex table structure	Create custom strategy
Performance issues	Processing entire page	Use `css_selector` to limit scope

Debug Logging

Enable verbose logging to understand extraction decisions:

import logging

# Configure logging
logging.basicConfig(level=logging.DEBUG)

# Enable verbose mode in strategy
strategy = DefaultTableExtraction(
    table_score_threshold=7,
    verbose=True  # Detailed extraction logs
)

config = CrawlerRunConfig(
    table_extraction=strategy,
    verbose=True  # General crawler logs
)

Table Extraction Strategies

Overview

What's Changed?

Key Points

Quick Start

The Simplest Way (Works Like Before)

Using the Old Configuration (Still Supported)

Table Extraction Strategies

Understanding the Strategy Pattern

Available Strategies

DefaultTableExtraction

Scoring System

LLMTableExtraction (Use Sparingly!)

When to Use LLMTableExtraction

How Smart Chunking Works

Performance Optimization for LLMTableExtraction

NoTableExtraction

Extracted Table Structure

Configuration Options

Basic Configuration

Advanced Configuration

Working with Extracted Tables

Convert to Pandas DataFrame

Filter Tables by Criteria

Export Tables to Different Formats

Creating Custom Strategies

Example: Financial Table Extractor

Example: Specific Table Extractor

Combining with Other Strategies

Performance Considerations

Optimization Tips

Migration Guide

Important: Your Code Still Works!

How It Works Internally

v0.7.2 and Earlier

v0.7.3+ (Current)

Taking Advantage of New Features

Summary

Best Practices

1. Choose the Right Strategy (Cost-Conscious Approach)

2. Validate Extracted Data

3. Handle Edge Cases

Troubleshooting

Common Issues and Solutions

Debug Logging

See Also