mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-22 02:38:06 +02:00
feat(map): layered URL discovery with bounded crawl fallback
Rescued from the stale perf/audit-fixes branch and ported cleanly onto current main (fetch + CLI only — the original commit never touched the server/MCP map surfaces). `--map` used to return only what a site advertises in sitemap.xml, which is nothing for sites with no sitemap (e.g. Hacker News) or a thin one. Now discovery is layered: - webclaw-fetch::discover_urls() / MapOptions — sitemaps first (authoritative, carries lastmod/priority/changefreq); when the sitemap is thin (< min_sitemap_urls) and the fallback is enabled, run a bounded same-origin crawl and harvest links from every fetched page plus the unfetched frontier, deduped against the sitemap set. - sitemap.rs: gzip (.xml.gz) support via a new decode_sitemap_body() + FetchClient::fetch_raw() (raw bytes, no lossy UTF-8); deeper index recursion (3->5); 4 more fallback paths. - CLI: --map-pages / --no-map-crawl / --map-limit; crawler logs now go to stderr so `--map -f json` stays machine-parseable. One new dependency: flate2 (already resolved in the lockfile transitively). Includes the commit's unit tests (map dedup/origin, gzip decode). Original work by the prior author on perf/audit-fixes; this re-applies only the map slice onto main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
c3e5ef5143
commit
179efbcf87
8 changed files with 485 additions and 12 deletions
1
Cargo.lock
generated
1
Cargo.lock
generated
|
|
@ -3265,6 +3265,7 @@ dependencies = [
|
|||
"async-trait",
|
||||
"bytes",
|
||||
"calamine",
|
||||
"flate2",
|
||||
"futures-util",
|
||||
"http",
|
||||
"quick-xml 0.37.5",
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue