mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-08 22:25:12 +02:00
Two targeted fixes surfaced by the manual extractor smoke test.
cloud::is_bot_protected:
- Trustpilot serves a ~565-byte AWS WAF interstitial with the string
"Verifying your connection..." and an `interstitial-spinner` div.
That pattern was not in our detector, so local fetch returned the
challenge page, JSON-LD parsing found nothing, and the extractor
emitted a confusing "no Organization/LocalBusiness JSON-LD" error.
- Added the pattern plus a <10KB size gate so real articles that
happen to mention the phrase aren't misclassified. Two new tests
cover positive + negative cases.
- With the fix, trustpilot_reviews now correctly escalates via
smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY"
actionable error without a key, or cloud-bypassed HTML with one.
ecommerce_product:
- Previously hard-failed when a page had no Product JSON-LD, and
produced an empty `offers` list when JSON-LD was present but its
`offers` node was. Many sites (Patagonia-style catalog pages,
smaller Squarespace stores) ship one or the other of OG / JSON-LD
but not both with price data.
- Added OG meta-tag fallback that handles:
* no JSON-LD at all -> build minimal payload from og:title,
og:image, og:description, product:price:amount,
product:price:currency, product:availability, product:brand
* JSON-LD present but offers empty -> augment with an OG-derived
offer so price comes through
- New `data_source` field: "jsonld", "jsonld+og", or "og_fallback"
so callers can tell which branch populated the data.
- `has_og_product_signal()` requires og:type=product or a price tag
so blog posts don't get mis-classified as products.
Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
|
||
|---|---|---|
| .. | ||
| src | ||
| tests | ||
| Cargo.toml | ||