Sites like Drudge Report use <table> for page layout, not data. Each cell
contains extensive block-level content (divs, hrs, paragraphs, links).
Previously, table_to_md() called inline_text() on every cell, collapsing
all whitespace and flattening block elements into a single unreadable line.
Changes:
- Add cell_has_block_content() heuristic: scans for block-level descendants
(p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables
- Layout tables render each cell as a standalone section separated by blank
lines, using children_to_md() to preserve block structure
- Data tables (no block elements in cells) keep existing markdown table format
- Bold/italic tags containing block elements are treated as containers
instead of wrapping in **/**/* (fixes Drudge's <b><font>...</font></b>
column wrappers that contain the entire column content)
- Add tests for layout tables with paragraphs and with links
__NEXT_DATA__, SvelteKit, and JSON-LD now appear as a
## Structured Data section in -f markdown and -f llm output.
Works with --only-main-content and all extraction flags.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">.
Now extracted as structured JSON (pageProps) in the structured_data field.
Tested on 45 sites — 13 return rich structured data including prices,
product info, and page state not visible in the DOM.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract structured JSON from SvelteKit kit.start() data arrays
- Convert JS object literals (unquoted keys) to valid JSON
- Data appears in structured_data field (machine-readable)
- License changed from MIT to AGPL-3.0
- Bump to v0.3.4
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Embeds QuickJS (rquickjs) to execute inline <script> tags and extract
data hidden in JavaScript variable assignments. Captures window.__*
objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired),
and self.__next_f (Next.js RSC flight data).
Results:
- NYTimes: 1,552 → 4,162 words (+168%)
- Wired: 1,459 → 9,937 words (+580%)
- Zero measurable performance overhead (<15ms per page)
- Feature-gated: disable with --no-default-features for WASM
Smart text filtering rejects CSS, base64, file paths, code strings.
Only readable prose is appended under "## Additional Content".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical:
- MCP server identifies as "webclaw-mcp" instead of "rmcp"
- Research tool poll loop capped at 200 iterations (~10 min)
CLI:
- Non-zero exit codes on errors
- Text format strips markdown table syntax
MCP server:
- URL validation on all tools
- 60s cloud API timeout, 30s local fetch timeout
- Diff cloud fallback computes actual diff
- Batch capped at 100 URLs, crawl at 500 pages
- Graceful startup failure instead of panic
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>