feat(core): --mode sections for nav-URL discovery

Section-URL ambiguity is recurring friction — callers have to guess
whether to hit infobae.com root (LATAM frontpage) or /economia/ (AR-
specific live FX dashboard), or decrypt.co root (ticker ribbon) vs
/news/ (article list), or bbc.com/news/world vs /news/world/europe/.
Each guess costs a round-trip.

New `--mode sections` returns the discoverable section URLs parsed
from the page's nav, in one round-trip. Subsumes issue #16 (non-
English nav harder to LLM-parse — sections come back as data, not
prose).

Multi-signal heuristic on the existing link extraction:
URL-pattern match (/<category>/ style short paths), repetition
(section links appear in header + footer), DOM-position when
available. Fallback when zero sections detected: emit top-N links
with a "(none detected; first N shown)" note.

Format: -f llm/text emits `Sections:` followed by `- [Label](url)`
list. -f json emits `{"sections": [{"label": "...", "url": "..."}]}`.

13 new tests in webclaw-core (688 -> 701).
This commit is contained in:
devnen 2026-05-23 23:14:40 +02:00
parent 76cd515a3e
commit ade2a5143c
4 changed files with 542 additions and 6 deletions

View file

@ -11,9 +11,11 @@ mod images;
mod links;
mod metadata;
mod output_size;
mod sections;
mod thin_body;
pub use hub_detect::{classify as classify_hub, HubClassification};
pub use sections::{collect_section_links, to_json_sections, to_llm_sections};
pub use thin_body::{classify as classify_thin_body, ThinBodyClassification};
pub use output_size::{
to_json_summary, to_json_toc, to_llm_summary, to_llm_toc, truncate_json_with_wrapper,