webclaw/crates/webclaw-fetch/Cargo.toml
Valerio 179efbcf87 feat(map): layered URL discovery with bounded crawl fallback
Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main (fetch + CLI only — the original commit never touched the
server/MCP map surfaces).

`--map` used to return only what a site advertises in sitemap.xml, which
is nothing for sites with no sitemap (e.g. Hacker News) or a thin one.
Now discovery is layered:

- webclaw-fetch::discover_urls() / MapOptions — sitemaps first
  (authoritative, carries lastmod/priority/changefreq); when the sitemap
  is thin (< min_sitemap_urls) and the fallback is enabled, run a bounded
  same-origin crawl and harvest links from every fetched page plus the
  unfetched frontier, deduped against the sitemap set.
- sitemap.rs: gzip (.xml.gz) support via a new decode_sitemap_body() +
  FetchClient::fetch_raw() (raw bytes, no lossy UTF-8); deeper index
  recursion (3->5); 4 more fallback paths.
- CLI: --map-pages / --no-map-crawl / --map-limit; crawler logs now go to
  stderr so `--map -f json` stays machine-parseable.

One new dependency: flate2 (already resolved in the lockfile transitively).
Includes the commit's unit tests (map dedup/origin, gzip decode). Original
work by the prior author on perf/audit-fixes; this re-applies only the map
slice onto main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:33:49 +02:00

38 lines
1.3 KiB
TOML

[package]
name = "webclaw-fetch"
description = "HTTP client with browser TLS fingerprint impersonation via wreq"
version.workspace = true
edition.workspace = true
license.workspace = true
[dependencies]
webclaw-core = { workspace = true }
webclaw-pdf = { path = "../webclaw-pdf" }
serde = { workspace = true }
thiserror = { workspace = true }
tracing = { workspace = true }
tokio = { workspace = true }
async-trait = "0.1"
# Pinned to exact pre-release versions: wreq/wreq-util are release candidates
# with no semver stability between rc.N builds. An exact pin keeps `cargo build`,
# `cargo install` (which ignores Cargo.lock), and the release workflow all on the
# version that compiles.
wreq = { version = "=6.0.0-rc.29", features = ["cookies", "gzip", "brotli", "zstd", "deflate", "stream"] }
wreq-util = "=3.0.0-rc.12"
http = "1"
bytes = "1"
# Stream adapter for `wreq::Response::bytes_stream()` (wreq 6.0.0-rc.29 dropped
# `Response::chunk()`); used to buffer bodies under the running size ceiling.
futures-util = "0.3"
url = "2"
rand = "0.8"
quick-xml = { version = "0.37", features = ["serde"] }
regex = "1"
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
serde_json.workspace = true
calamine = "0.34"
zip = "2"
flate2 = "1"
[dev-dependencies]
tempfile = "3"