mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-28 03:29:38 +02:00
feat(core): --mode sections for nav-URL discovery
Section-URL ambiguity is recurring friction — callers have to guess whether to hit infobae.com root (LATAM frontpage) or /economia/ (AR- specific live FX dashboard), or decrypt.co root (ticker ribbon) vs /news/ (article list), or bbc.com/news/world vs /news/world/europe/. Each guess costs a round-trip. New `--mode sections` returns the discoverable section URLs parsed from the page's nav, in one round-trip. Subsumes issue #16 (non- English nav harder to LLM-parse — sections come back as data, not prose). Multi-signal heuristic on the existing link extraction: URL-pattern match (/<category>/ style short paths), repetition (section links appear in header + footer), DOM-position when available. Fallback when zero sections detected: emit top-N links with a "(none detected; first N shown)" note. Format: -f llm/text emits `Sections:` followed by `- [Label](url)` list. -f json emits `{"sections": [{"label": "...", "url": "..."}]}`. 13 new tests in webclaw-core (688 -> 701).
This commit is contained in:
parent
76cd515a3e
commit
ade2a5143c
4 changed files with 542 additions and 6 deletions
|
|
@ -11,9 +11,11 @@ mod images;
|
|||
mod links;
|
||||
mod metadata;
|
||||
mod output_size;
|
||||
mod sections;
|
||||
mod thin_body;
|
||||
|
||||
pub use hub_detect::{classify as classify_hub, HubClassification};
|
||||
pub use sections::{collect_section_links, to_json_sections, to_llm_sections};
|
||||
pub use thin_body::{classify as classify_thin_body, ThinBodyClassification};
|
||||
pub use output_size::{
|
||||
to_json_summary, to_json_toc, to_llm_summary, to_llm_toc, truncate_json_with_wrapper,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue