webclaw/crates
devnen 95a6681b02 fix: detect layout tables and render as sections instead of markdown tables
Sites like Drudge Report use <table> for page layout, not data. Each cell
contains extensive block-level content (divs, hrs, paragraphs, links).

Previously, table_to_md() called inline_text() on every cell, collapsing
all whitespace and flattening block elements into a single unreadable line.

Changes:
- Add cell_has_block_content() heuristic: scans for block-level descendants
  (p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables
- Layout tables render each cell as a standalone section separated by blank
  lines, using children_to_md() to preserve block structure
- Data tables (no block elements in cells) keep existing markdown table format
- Bold/italic tags containing block elements are treated as containers
  instead of wrapping in **/**/* (fixes Drudge's <b><font>...</font></b>
  column wrappers that contain the entire column content)
- Add tests for layout tables with paragraphs and with links
2026-04-03 22:24:35 +02:00
..
webclaw-cli feat: CLI --research flag + MCP cloud fallback + structured research output 2026-04-03 14:04:04 +02:00
webclaw-core fix: detect layout tables and render as sections instead of markdown tables 2026-04-03 22:24:35 +02:00
webclaw-fetch style: cargo fmt 2026-04-01 18:25:40 +02:00
webclaw-llm Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
webclaw-mcp fix: MCP research saves to file, returns compact response 2026-04-03 16:05:45 +02:00
webclaw-pdf Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00