Migration Guide: Table Extraction v0.7.3

Overview

Version 0.7.3 introduces the Table Extraction Strategy Pattern, providing a more flexible and extensible approach to table extraction while maintaining full backward compatibility.

What's New

Strategy Pattern Implementation

Table extraction now follows the same strategy pattern used throughout Crawl4AI:

  • Consistent Architecture: Aligns with extraction, chunking, and markdown strategies
  • Extensibility: Easy to create custom table extraction strategies
  • Better Separation: Table logic moved from content scraping to dedicated module
  • Full Control: Fine-grained control over table detection and extraction

New Classes

from crawl4ai import (
    TableExtractionStrategy,    # Abstract base class
    DefaultTableExtraction,      # Current implementation (default)
    NoTableExtraction           # Explicitly disable extraction
)

Backward Compatibility

βœ… All existing code continues to work without changes.

No Changes Required

If your code looks like this, it will continue to work:

# This still works exactly the same
config = CrawlerRunConfig(
    table_score_threshold=7
)
result = await crawler.arun(url, config)
tables = result.tables  # Same structure, same data

What Happens Behind the Scenes

When you don't specify a table_extraction strategy:

  1. CrawlerRunConfig automatically creates DefaultTableExtraction
  2. It uses your table_score_threshold parameter
  3. Tables are extracted exactly as before
  4. Results appear in result.tables with the same structure

New Capabilities

1. Explicit Strategy Configuration

You can now explicitly configure table extraction:

# New: Explicit control
strategy = DefaultTableExtraction(
    table_score_threshold=7,
    min_rows=2,              # New: minimum row filter
    min_cols=2,              # New: minimum column filter
    verbose=True             # New: detailed logging
)

config = CrawlerRunConfig(
    table_extraction=strategy
)

2. Disable Table Extraction

Improve performance when tables aren't needed:

# New: Skip table extraction entirely
config = CrawlerRunConfig(
    table_extraction=NoTableExtraction()
)
# No CPU cycles spent on table detection/extraction

3. Custom Extraction Strategies

Create specialized extractors:

class MyTableExtractor(TableExtractionStrategy):
    def extract_tables(self, element, **kwargs):
        # Custom extraction logic
        return custom_tables

config = CrawlerRunConfig(
    table_extraction=MyTableExtractor()
)

Migration Scenarios

Scenario 1: Basic Usage (No Changes Needed)

Before (v0.7.2):

config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
    print(table['headers'])

After (v0.7.3):

# Exactly the same - no changes required
config = CrawlerRunConfig()
result = await crawler.arun(url, config)
for table in result.tables:
    print(table['headers'])

Scenario 2: Custom Threshold (No Changes Needed)

Before (v0.7.2):

config = CrawlerRunConfig(
    table_score_threshold=5
)

After (v0.7.3):

# Still works the same
config = CrawlerRunConfig(
    table_score_threshold=5
)

# Or use new explicit approach for more control
strategy = DefaultTableExtraction(
    table_score_threshold=5,
    min_rows=2  # Additional filtering
)
config = CrawlerRunConfig(
    table_extraction=strategy
)

Scenario 3: Advanced Filtering (New Feature)

Before (v0.7.2):

# Had to filter after extraction
config = CrawlerRunConfig(
    table_score_threshold=5
)
result = await crawler.arun(url, config)

# Manual filtering
large_tables = [
    t for t in result.tables 
    if len(t['rows']) >= 5 and len(t['headers']) >= 3
]

After (v0.7.3):

# Filter during extraction (more efficient)
strategy = DefaultTableExtraction(
    table_score_threshold=5,
    min_rows=5,
    min_cols=3
)
config = CrawlerRunConfig(
    table_extraction=strategy
)
result = await crawler.arun(url, config)
# result.tables already filtered

Code Organization Changes

Module Structure

Before (v0.7.2):

crawl4ai/
  content_scraping_strategy.py
    - LXMLWebScrapingStrategy
      - is_data_table()      # Table detection
      - extract_table_data() # Table extraction

After (v0.7.3):

crawl4ai/
  content_scraping_strategy.py
    - LXMLWebScrapingStrategy
      # Table methods removed, uses strategy

  table_extraction.py (NEW)
    - TableExtractionStrategy    # Base class
    - DefaultTableExtraction      # Moved logic here
    - NoTableExtraction          # New option

Import Changes

New imports available (optional):

# These are now available but not required for existing code
from crawl4ai import (
    TableExtractionStrategy,
    DefaultTableExtraction,
    NoTableExtraction
)

Performance Implications

No Performance Impact

For existing code, performance remains identical: - Same extraction logic - Same scoring algorithm - Same processing time

Performance Improvements Available

New options for better performance:

# Skip tables entirely (faster)
config = CrawlerRunConfig(
    table_extraction=NoTableExtraction()
)

# Process only specific areas (faster)
config = CrawlerRunConfig(
    css_selector="main.content",
    table_extraction=DefaultTableExtraction(
        min_rows=5,  # Skip small tables
        min_cols=3
    )
)

Testing Your Migration

Verification Script

Run this to verify your extraction still works:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def verify_extraction():
    url = "your_url_here"

    async with AsyncWebCrawler() as crawler:
        # Test 1: Old approach
        config_old = CrawlerRunConfig(
            table_score_threshold=7
        )
        result_old = await crawler.arun(url, config_old)

        # Test 2: New explicit approach
        from crawl4ai import DefaultTableExtraction
        config_new = CrawlerRunConfig(
            table_extraction=DefaultTableExtraction(
                table_score_threshold=7
            )
        )
        result_new = await crawler.arun(url, config_new)

        # Compare results
        assert len(result_old.tables) == len(result_new.tables)
        print(f"βœ“ Both approaches extracted {len(result_old.tables)} tables")

        # Verify structure
        for old, new in zip(result_old.tables, result_new.tables):
            assert old['headers'] == new['headers']
            assert old['rows'] == new['rows']

        print("βœ“ Table content identical")

asyncio.run(verify_extraction())

Deprecation Notes

No Deprecations

  • All existing parameters continue to work
  • table_score_threshold in CrawlerRunConfig is still supported
  • No breaking changes

Internal Changes (Transparent to Users)

  • LXMLWebScrapingStrategy.is_data_table() - Moved to DefaultTableExtraction
  • LXMLWebScrapingStrategy.extract_table_data() - Moved to DefaultTableExtraction

These methods were internal and not part of the public API.

Benefits of Upgrading

While not required, using the new pattern provides:

  1. Better Control: Filter tables during extraction, not after
  2. Performance Options: Skip extraction when not needed
  3. Extensibility: Create custom extractors for specific needs
  4. Consistency: Same pattern as other Crawl4AI strategies
  5. Future-Proof: Ready for upcoming advanced strategies

Troubleshooting

Issue: Different Number of Tables

Cause: Threshold or filtering differences

Solution:

# Ensure same threshold
strategy = DefaultTableExtraction(
    table_score_threshold=7,  # Match your old setting
    min_rows=0,               # No filtering (default)
    min_cols=0                # No filtering (default)
)

Issue: Import Errors

Cause: Using new classes without importing

Solution:

# Add imports if using new features
from crawl4ai import (
    DefaultTableExtraction,
    NoTableExtraction,
    TableExtractionStrategy
)

Issue: Custom Strategy Not Working

Cause: Incorrect method signature

Solution:

class CustomExtractor(TableExtractionStrategy):
    def extract_tables(self, element, **kwargs):  # Correct signature
        # Not: extract_tables(self, html)
        # Not: extract(self, element)
        return tables_list

Getting Help

If you encounter issues:

  1. Check your table_score_threshold matches previous settings
  2. Verify imports if using new classes
  3. Enable verbose logging: DefaultTableExtraction(verbose=True)
  4. Review the Table Extraction Documentation
  5. Check examples

Summary

  • βœ… Full backward compatibility - No code changes required
  • βœ… Same results - Identical extraction behavior by default
  • βœ… New options - Additional control when needed
  • βœ… Better architecture - Consistent with Crawl4AI patterns
  • βœ… Ready for future - Foundation for advanced strategies

The migration to v0.7.3 is seamless with no required changes while providing new capabilities for those who need them.


> Feedback