fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product

Two targeted fixes surfaced by the manual extractor smoke test.

cloud::is_bot_protected:
- Trustpilot serves a ~565-byte AWS WAF interstitial with the string
  "Verifying your connection..." and an `interstitial-spinner` div.
  That pattern was not in our detector, so local fetch returned the
  challenge page, JSON-LD parsing found nothing, and the extractor
  emitted a confusing "no Organization/LocalBusiness JSON-LD" error.
- Added the pattern plus a <10KB size gate so real articles that
  happen to mention the phrase aren't misclassified. Two new tests
  cover positive + negative cases.
- With the fix, trustpilot_reviews now correctly escalates via
  smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY"
  actionable error without a key, or cloud-bypassed HTML with one.

ecommerce_product:
- Previously hard-failed when a page had no Product JSON-LD, and
  produced an empty `offers` list when JSON-LD was present but its
  `offers` node was. Many sites (Patagonia-style catalog pages,
  smaller Squarespace stores) ship one or the other of OG / JSON-LD
  but not both with price data.
- Added OG meta-tag fallback that handles:
  * no JSON-LD at all -> build minimal payload from og:title,
    og:image, og:description, product:price:amount,
    product:price:currency, product:availability, product:brand
  * JSON-LD present but offers empty -> augment with an OG-derived
    offer so price comes through
- New `data_source` field: "jsonld", "jsonld+og", or "og_fallback"
  so callers can tell which branch populated the data.
- `has_og_product_signal()` requires og:type=product or a price tag
  so blog posts don't get mis-classified as products.

Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
This commit is contained in:
Valerio 2026-04-22 17:07:31 +02:00
parent 7f5eb93b65
commit a53578e45c
2 changed files with 299 additions and 28 deletions

View file

@ -325,6 +325,18 @@ pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
return true;
}
// AWS WAF "Verifying your connection" interstitial (used by Trustpilot).
// Distinct from the captcha-branded path above: the challenge page is
// a tiny HTML shell with an `interstitial-spinner` div and no content.
// Gating on html.len() keeps false-positives off long pages that
// happen to mention the phrase in an unrelated context.
if html_lower.contains("interstitial-spinner")
&& html_lower.contains("verifying your connection")
&& html.len() < 10_000
{
return true;
}
// hCaptcha *blocking* page (not just an embedded widget).
if html_lower.contains("hcaptcha.com")
&& html_lower.contains("h-captcha")
@ -564,6 +576,26 @@ mod tests {
assert!(!is_bot_protected(&html, &empty_headers()));
}
#[test]
fn is_bot_protected_detects_aws_waf_verifying_connection() {
// The exact shape Trustpilot serves under AWS WAF.
let html = r#"<div class="container"><div id="loading-state">
<div class="interstitial-spinner" id="spinner"></div>
<h1>Verifying your connection...</h1></div></div>"#;
assert!(is_bot_protected(html, &empty_headers()));
}
#[test]
fn is_bot_protected_ignores_phrase_on_real_content() {
// A real article that happens to mention the phrase in prose
// should not trigger the short-page detector.
let html = format!(
"<html><body>{}<p>Verifying your connection is tricky.</p></body></html>",
"article text ".repeat(2_000)
);
assert!(!is_bot_protected(&html, &empty_headers()));
}
#[test]
fn needs_js_rendering_flags_spa_skeleton() {
let html = format!(