Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main (fetch + CLI only — the original commit never touched the
server/MCP map surfaces).
`--map` used to return only what a site advertises in sitemap.xml, which
is nothing for sites with no sitemap (e.g. Hacker News) or a thin one.
Now discovery is layered:
- webclaw-fetch::discover_urls() / MapOptions — sitemaps first
(authoritative, carries lastmod/priority/changefreq); when the sitemap
is thin (< min_sitemap_urls) and the fallback is enabled, run a bounded
same-origin crawl and harvest links from every fetched page plus the
unfetched frontier, deduped against the sitemap set.
- sitemap.rs: gzip (.xml.gz) support via a new decode_sitemap_body() +
FetchClient::fetch_raw() (raw bytes, no lossy UTF-8); deeper index
recursion (3->5); 4 more fallback paths.
- CLI: --map-pages / --no-map-crawl / --map-limit; crawler logs now go to
stderr so `--map -f json` stays machine-parseable.
One new dependency: flate2 (already resolved in the lockfile transitively).
Includes the commit's unit tests (map dedup/origin, gzip decode). Original
work by the prior author on perf/audit-fixes; this re-applies only the map
slice onto main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>