webclaw/crates
Valerio 179efbcf87 feat(map): layered URL discovery with bounded crawl fallback
Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main (fetch + CLI only — the original commit never touched the
server/MCP map surfaces).

`--map` used to return only what a site advertises in sitemap.xml, which
is nothing for sites with no sitemap (e.g. Hacker News) or a thin one.
Now discovery is layered:

- webclaw-fetch::discover_urls() / MapOptions — sitemaps first
  (authoritative, carries lastmod/priority/changefreq); when the sitemap
  is thin (< min_sitemap_urls) and the fallback is enabled, run a bounded
  same-origin crawl and harvest links from every fetched page plus the
  unfetched frontier, deduped against the sitemap set.
- sitemap.rs: gzip (.xml.gz) support via a new decode_sitemap_body() +
  FetchClient::fetch_raw() (raw bytes, no lossy UTF-8); deeper index
  recursion (3->5); 4 more fallback paths.
- CLI: --map-pages / --no-map-crawl / --map-limit; crawler logs now go to
  stderr so `--map -f json` stays machine-parseable.

One new dependency: flate2 (already resolved in the lockfile transitively).
Includes the commit's unit tests (map dedup/origin, gzip decode). Original
work by the prior author on perf/audit-fixes; this re-applies only the map
slice onto main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:33:49 +02:00
..
webclaw-cli feat(map): layered URL discovery with bounded crawl fallback 2026-06-17 15:33:49 +02:00
webclaw-core fix: harden LLM providers, UTF-8 handling, and webhook/batch reliability 2026-06-09 21:10:15 +02:00
webclaw-fetch feat(map): layered URL discovery with bounded crawl fallback 2026-06-17 15:33:49 +02:00
webclaw-llm feat(llm): add Gemini provider and fix stale Anthropic default model 2026-06-16 15:52:37 +02:00
webclaw-mcp feat(search): standalone web search via Serper.dev (bring-your-own-key) 2026-06-17 15:10:58 +02:00
webclaw-pdf Initial release: webclaw v0.1.0 — web content extraction for LLMs 2026-03-23 18:31:11 +01:00
webclaw-server feat(search): standalone web search via Serper.dev (bring-your-own-key) 2026-06-17 15:10:58 +02:00