mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-29 03:39:37 +02:00
feat(map): layered URL discovery with crawl fallback
map falls back to a bounded same-origin crawl when a site has no sitemap or a thin one, harvesting links from each fetched page (the rich source). Adds gzip (.xml.gz) sitemap support, deeper sitemap-index recursion + more fallback paths, uncapped-by-default results with an optional --map-limit / --map-pages, and routes crawler logs to stderr so --map -f json stays machine-parseable.
This commit is contained in:
parent
02302e7a1d
commit
b7bd1155c6
10 changed files with 478 additions and 12 deletions
1
Cargo.lock
generated
1
Cargo.lock
generated
|
|
@ -3263,6 +3263,7 @@ dependencies = [
|
|||
"async-trait",
|
||||
"bytes",
|
||||
"calamine",
|
||||
"flate2",
|
||||
"http",
|
||||
"quick-xml 0.37.5",
|
||||
"rand 0.8.5",
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue