Fit Markdown with Pruning & BM25
Fit Markdown is a specialized filtered version of your page’s markdown, focusing on the most relevant content. By default, Crawl4AI converts the entire HTML into a broad raw_markdown. With fit markdown, we apply a content filter algorithm (e.g., Pruning or BM25) to remove or rank low-value sections—such as repetitive sidebars, shallow text blocks, or irrelevancies—leaving a concise textual “core.”
1. How “Fit Markdown” Works
1.1 The content_filter
In CrawlerRunConfig
, you can specify a content_filter
to shape how content is pruned or ranked before final markdown generation. A filter’s logic is applied before or during the HTML→Markdown process, producing:
result.markdown_v2.raw_markdown
(unfiltered)result.markdown_v2.fit_markdown
(filtered or “fit” version)result.markdown_v2.fit_html
(the corresponding HTML snippet that producedfit_markdown
)
Note: We’re currently storing the result in
markdown_v2
, but eventually we’ll unify it asresult.markdown
.
1.2 Common Filters
1. PruningContentFilter – Scores each node by text density, link density, and tag importance, discarding those below a threshold.
2. BM25ContentFilter – Focuses on textual relevance using BM25 ranking, especially useful if you have a specific user query (e.g., “machine learning” or “food nutrition”).
2. PruningContentFilter
Pruning discards less relevant nodes based on text density, link density, and tag importance. It’s a heuristic-based approach—if certain sections appear too “thin” or too “spammy,” they’re pruned.
2.1 Usage Example
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Step 1: Create a pruning filter
prune_filter = PruningContentFilter(
# Lower → more content retained, higher → more content pruned
threshold=0.45,
# "fixed" or "dynamic"
threshold_type="dynamic",
# Ignore nodes with <5 words
min_word_threshold=5
)
# Step 2: Insert it into a Markdown Generator
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
# Step 3: Pass it to CrawlerRunConfig
config = CrawlerRunConfig(
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com",
config=config
)
if result.success:
# 'fit_markdown' is your pruned content, focusing on "denser" text
print("Raw Markdown length:", len(result.markdown_v2.raw_markdown))
print("Fit Markdown length:", len(result.markdown_v2.fit_markdown))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
2.2 Key Parameters
min_word_threshold
(int): If a block has fewer words than this, it’s pruned.threshold_type
(str):"fixed"
→ each node must exceedthreshold
(0–1)."dynamic"
→ node scoring adjusts according to tag type, text/link density, etc.threshold
(float, default ~0.48): The base or “anchor” cutoff.
Algorithmic Factors:
- Text density – Encourages blocks that have a higher ratio of text to overall content.
- Link density – Penalizes sections that are mostly links.
- Tag importance – e.g., an
<article>
or<p>
might be more important than a<div>
. - Structural context – If a node is deeply nested or in a suspected sidebar, it might be deprioritized.
3. BM25ContentFilter
BM25 is a classical text ranking algorithm often used in search engines. If you have a user query or rely on page metadata to derive a query, BM25 can identify which text chunks best match that query.
3.1 Usage Example
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# 1) A BM25 filter with a user query
bm25_filter = BM25ContentFilter(
user_query="startup fundraising tips",
# Adjust for stricter or looser results
bm25_threshold=1.2
)
# 2) Insert into a Markdown Generator
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
# 3) Pass to crawler config
config = CrawlerRunConfig(
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com",
config=config
)
if result.success:
print("Fit Markdown (BM25 query-based):")
print(result.markdown_v2.fit_markdown)
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
3.2 Parameters
user_query
(str, optional): E.g."machine learning"
. If blank, the filter tries to glean a query from page metadata.bm25_threshold
(float, default 1.0):- Higher → fewer chunks but more relevant.
- Lower → more inclusive.
In more advanced scenarios, you might see parameters like
use_stemming
,case_sensitive
, orpriority_tags
to refine how text is tokenized or weighted.
4. Accessing the “Fit” Output
After the crawl, your “fit” content is found in result.markdown_v2.fit_markdown
. In future versions, it will be result.markdown.fit_markdown
. Meanwhile:
If the content filter is BM25, you might see additional logic or references in fit_markdown
that highlight relevant segments. If it’s Pruning, the text is typically well-cleaned but not necessarily matched to a query.
5. Code Patterns Recap
5.1 Pruning
prune_filter = PruningContentFilter(
threshold=0.5,
threshold_type="fixed",
min_word_threshold=10
)
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
# => result.markdown_v2.fit_markdown
5.2 BM25
bm25_filter = BM25ContentFilter(
user_query="health benefits fruit",
bm25_threshold=1.2
)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
# => result.markdown_v2.fit_markdown
6. Combining with “word_count_threshold” & Exclusions
Remember you can also specify:
config = CrawlerRunConfig(
word_count_threshold=10,
excluded_tags=["nav", "footer", "header"],
exclude_external_links=True,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.5)
)
)
Thus, multi-level filtering occurs:
- The crawler’s
excluded_tags
are removed from the HTML first. - The content filter (Pruning, BM25, or custom) prunes or ranks the remaining text blocks.
- The final “fit” content is generated in
result.markdown_v2.fit_markdown
.
7. Custom Filters
If you need a different approach (like a specialized ML model or site-specific heuristics), you can create a new class inheriting from RelevantContentFilter
and implement filter_content(html)
. Then inject it into your markdown generator:
from crawl4ai.content_filter_strategy import RelevantContentFilter
class MyCustomFilter(RelevantContentFilter):
def filter_content(self, html, min_word_threshold=None):
# parse HTML, implement custom logic
return [block for block in ... if ... some condition...]
Steps:
- Subclass
RelevantContentFilter
. - Implement
filter_content(...)
. - Use it in your
DefaultMarkdownGenerator(content_filter=MyCustomFilter(...))
.
8. Final Thoughts
Fit Markdown is a crucial feature for:
- Summaries: Quickly get the important text from a cluttered page.
- Search: Combine with BM25 to produce content relevant to a query.
- AI Pipelines: Filter out boilerplate so LLM-based extraction or summarization runs on denser text.
Key Points:
- PruningContentFilter: Great if you just want the “meatiest” text without a user query.
- BM25ContentFilter: Perfect for query-based extraction or searching.
- Combine with excluded_tags
, exclude_external_links
, word_count_threshold
to refine your final “fit” text.
- Fit markdown ends up in result.markdown_v2.fit_markdown
; eventually result.markdown.fit_markdown
in future versions.
With these tools, you can zero in on the text that truly matters, ignoring spammy or boilerplate content, and produce a concise, relevant “fit markdown” for your AI or data pipelines. Happy pruning and searching!
- Last Updated: 2025-01-01