webclaw/crates/webclaw-fetch/src/locale.rs
webclaw 02302e7a1d perf(core): hot-path extraction speedups + senior-grade hardening
Extraction ~22% faster on the corpus benchmark with byte-identical output:
- hoist recompiled CSS selectors in the markdown noise path
- single-pass shared og() meta parsing across vertical extractors
- output-safe QuickJS gating (skip the JS VM when no candidate data) +
  reuse the already-parsed document instead of re-parsing
- wreq connect_timeout + connection-pool tuning; dedup the retry loop

Reliability + correctness:
- char-boundary-safe truncation of LLM error bodies (shared helper)
- HTTP connect/read timeouts on all LLM provider clients
- isolate pdf-extract behind catch_unwind + spawn_blocking
- OSS server: crawl inherits the shared fetch profile; ProviderChain built
  once in AppState; request TimeoutLayer

API / safety / docs:
- #[non_exhaustive] on public enums + result structs (+ builders)
- #![forbid(unsafe_code)] on pure crates, deny on llm
- //! crate docs + doctests; scrub bypass/vendor/target specifics from
  public crate docs and comments

Tooling: [profile.release] lto/codegen-units/strip, MSRV pin, deny.toml +
cargo-deny CI, macOS test matrix. CLI main.rs split into focused modules.
2026-06-04 20:22:00 +02:00

77 lines
2.5 KiB
Rust

//! Derive an `Accept-Language` header from a URL.
//!
//! Some bot-detection systems on country-specific sites do a geo-vs-locale
//! sanity check: an IP in the target country + a browser UA but the wrong
//! `Accept-Language` is a bot signal. Matching the site's expected locale
//! avoids that mismatch.
//!
//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
/// Best-effort `Accept-Language` header value for the given URL's TLD.
/// Returns `None` if the URL cannot be parsed.
pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
let tld = host.rsplit('.').next()?;
Some(accept_language_for_tld(tld))
}
/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
/// Unknown TLDs fall back to US English.
pub fn accept_language_for_tld(tld: &str) -> &'static str {
match tld {
"it" => "it-IT,it;q=0.9",
"fr" => "fr-FR,fr;q=0.9",
"de" | "at" => "de-DE,de;q=0.9",
"es" => "es-ES,es;q=0.9",
"pt" => "pt-PT,pt;q=0.9",
"nl" => "nl-NL,nl;q=0.9",
"pl" => "pl-PL,pl;q=0.9",
"se" => "sv-SE,sv;q=0.9",
"no" => "nb-NO,nb;q=0.9",
"dk" => "da-DK,da;q=0.9",
"fi" => "fi-FI,fi;q=0.9",
"cz" => "cs-CZ,cs;q=0.9",
"ro" => "ro-RO,ro;q=0.9",
"gr" => "el-GR,el;q=0.9",
"tr" => "tr-TR,tr;q=0.9",
"ru" => "ru-RU,ru;q=0.9",
"jp" => "ja-JP,ja;q=0.9",
"kr" => "ko-KR,ko;q=0.9",
"cn" => "zh-CN,zh;q=0.9",
"tw" | "hk" => "zh-TW,zh;q=0.9",
"br" => "pt-BR,pt;q=0.9",
"mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
"uk" | "ie" => "en-GB,en;q=0.9",
_ => "en-US,en;q=0.9",
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn tld_dispatch() {
assert_eq!(
accept_language_for_url("https://www.example.it/page/1"),
Some("it-IT,it;q=0.9")
);
assert_eq!(
accept_language_for_url("https://www.example.fr/"),
Some("fr-FR,fr;q=0.9")
);
assert_eq!(
accept_language_for_url("https://www.example.co.uk/"),
Some("en-GB,en;q=0.9")
);
assert_eq!(
accept_language_for_url("https://example.com/"),
Some("en-US,en;q=0.9")
);
}
#[test]
fn bad_url_returns_none() {
assert_eq!(accept_language_for_url("not-a-url"), None);
}
}