mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-07 22:15:12 +02:00
fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product
Two targeted fixes surfaced by the manual extractor smoke test.
cloud::is_bot_protected:
- Trustpilot serves a ~565-byte AWS WAF interstitial with the string
"Verifying your connection..." and an `interstitial-spinner` div.
That pattern was not in our detector, so local fetch returned the
challenge page, JSON-LD parsing found nothing, and the extractor
emitted a confusing "no Organization/LocalBusiness JSON-LD" error.
- Added the pattern plus a <10KB size gate so real articles that
happen to mention the phrase aren't misclassified. Two new tests
cover positive + negative cases.
- With the fix, trustpilot_reviews now correctly escalates via
smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY"
actionable error without a key, or cloud-bypassed HTML with one.
ecommerce_product:
- Previously hard-failed when a page had no Product JSON-LD, and
produced an empty `offers` list when JSON-LD was present but its
`offers` node was. Many sites (Patagonia-style catalog pages,
smaller Squarespace stores) ship one or the other of OG / JSON-LD
but not both with price data.
- Added OG meta-tag fallback that handles:
* no JSON-LD at all -> build minimal payload from og:title,
og:image, og:description, product:price:amount,
product:price:currency, product:availability, product:brand
* JSON-LD present but offers empty -> augment with an OG-derived
offer so price comes through
- New `data_source` field: "jsonld", "jsonld+og", or "og_fallback"
so callers can tell which branch populated the data.
- `has_og_product_signal()` requires og:type=product or a price tag
so blog posts don't get mis-classified as products.
Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
This commit is contained in:
parent
7f5eb93b65
commit
a53578e45c
2 changed files with 299 additions and 28 deletions
|
|
@ -325,6 +325,18 @@ pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
|
|||
return true;
|
||||
}
|
||||
|
||||
// AWS WAF "Verifying your connection" interstitial (used by Trustpilot).
|
||||
// Distinct from the captcha-branded path above: the challenge page is
|
||||
// a tiny HTML shell with an `interstitial-spinner` div and no content.
|
||||
// Gating on html.len() keeps false-positives off long pages that
|
||||
// happen to mention the phrase in an unrelated context.
|
||||
if html_lower.contains("interstitial-spinner")
|
||||
&& html_lower.contains("verifying your connection")
|
||||
&& html.len() < 10_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// hCaptcha *blocking* page (not just an embedded widget).
|
||||
if html_lower.contains("hcaptcha.com")
|
||||
&& html_lower.contains("h-captcha")
|
||||
|
|
@ -564,6 +576,26 @@ mod tests {
|
|||
assert!(!is_bot_protected(&html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_detects_aws_waf_verifying_connection() {
|
||||
// The exact shape Trustpilot serves under AWS WAF.
|
||||
let html = r#"<div class="container"><div id="loading-state">
|
||||
<div class="interstitial-spinner" id="spinner"></div>
|
||||
<h1>Verifying your connection...</h1></div></div>"#;
|
||||
assert!(is_bot_protected(html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_ignores_phrase_on_real_content() {
|
||||
// A real article that happens to mention the phrase in prose
|
||||
// should not trigger the short-page detector.
|
||||
let html = format!(
|
||||
"<html><body>{}<p>Verifying your connection is tricky.</p></body></html>",
|
||||
"article text ".repeat(2_000)
|
||||
);
|
||||
assert!(!is_bot_protected(&html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn needs_js_rendering_flags_spa_skeleton() {
|
||||
let html = format!(
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue