feat(map): layered URL discovery with crawl fallback

map falls back to a bounded same-origin crawl when a site has no sitemap
or a thin one, harvesting links from each fetched page (the rich source).
Adds gzip (.xml.gz) sitemap support, deeper sitemap-index recursion + more
fallback paths, uncapped-by-default results with an optional --map-limit /
--map-pages, and routes crawler logs to stderr so --map -f json stays
machine-parseable.
This commit is contained in:
webclaw 2026-06-06 12:08:26 +02:00
parent 02302e7a1d
commit b7bd1155c6
10 changed files with 478 additions and 12 deletions

1
Cargo.lock generated
View file

@ -3263,6 +3263,7 @@ dependencies = [
"async-trait",
"bytes",
"calamine",
"flate2",
"http",
"quick-xml 0.37.5",
"rand 0.8.5",