Commit graph

2 commits

Author SHA1 Message Date
Valerio
7f5eb93b65 feat(extractors): wave 6b, etsy_listing + HTML fallbacks for substack/youtube
Adds etsy_listing and hardens two existing extractors with HTML fallbacks
so transient API failures still return useful data.

New:
- etsy_listing: /listing/{id}(/slug) with Schema.org Product JSON-LD +
  OG fallback. Antibot-gated, routes through cloud::smart_fetch_html
  like amazon_product and ebay_listing. Auto-dispatched (etsy host is
  unique).

Hardened:
- substack_post: when /api/v1/posts/{slug} returns non-200 (rate limit,
  403 on hardened custom domains, 5xx), fall back to HTML fetch and
  parse OG tags + Article JSON-LD. Response shape is stable across
  both paths, with a `data_source` field of "api" or "html_fallback".
- youtube_video: when ytInitialPlayerResponse is missing (EU-consent
  interstitial, age-gated, some live pre-shows), fall back to OG tags
  for title/description/thumbnail. `data_source` now "player_response"
  or "og_fallback".

Tests: 91 passing in webclaw-fetch (9 new), clippy clean.
2026-04-22 16:44:51 +02:00
Valerio
8cc727c2f2 feat(extractors): wave 6a, 5 easy verticals (27 total)
Adds 5 structured extractors that hit public APIs with stable shapes:

- github_issue: /repos/{o}/{r}/issues/{n} (rejects PRs, points to github_pr)
- shopify_collection: /collections/{handle}.json + products.json
- woocommerce_product: /wp-json/wc/store/v1/products?slug={slug}
- substack_post: /api/v1/posts/{slug} (works on custom domains too)
- youtube_video: ytInitialPlayerResponse blob from /watch HTML

Auto-dispatched: github_issue, youtube_video (unique hosts and stable
URL shapes). Explicit-call: shopify_collection, woocommerce_product,
substack_post (URL shapes overlap with non-target sites).

Tests: 82 total passing in webclaw-fetch (12 new), clippy clean.
2026-04-22 16:33:35 +02:00