Commit graph

10 commits

Author SHA1 Message Date
Valerio
058493bc8f feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend
Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now
take `client: &dyn Fetcher` instead of `client: &FetchClient` directly.
Backwards-compatible: FetchClient implements Fetcher, blanket impls
cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server
callers keep working unchanged.

Motivation: the production API server (api.webclaw.io) must not do
in-process TLS fingerprinting; it delegates all HTTP to the Go
tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on
production would have required importing wreq into the server's
dep graph, violating the CLAUDE.md rule. Now production can provide
its own TlsSidecarFetcher implementation and pass it to the same
dispatcher the OSS server uses.

Changes:
- New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus
  blanket impls for `&T` and `Arc<T>`.
- `FetchClient` gains a tiny impl block in client.rs that forwards to
  its existing public methods.
- All 28 extractor signatures migrated from `&FetchClient` to
  `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change).
- `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`.
- `extractors::dispatch_by_url` and `extractors::dispatch_by_name`
  take `&dyn Fetcher`.
- `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has
  native async-fn-in-trait but dyn dispatch still needs async_trait).
- Version bumped to 0.5.1, CHANGELOG updated.

Tests: 215 passing in webclaw-fetch (no new tests needed — the existing
extractor tests exercise the trait methods transparently).
Clippy: clean workspace-wide.
2026-04-22 21:17:50 +02:00
Valerio
b2e7dbf365 fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs)
Addresses the four follow-ups surfaced by the cloud-key smoke test.

trustpilot_reviews — full rewrite for 2025 schema:
- Trustpilot moved from single-Organization+aggregateRating to three
  separate JSON-LD blocks: a site-level Organization (Trustpilot
  itself), a Dataset with a csvw:Table mainEntity carrying the
  per-star distribution for the target business, and an aiSummary +
  aiSummaryReviews block with the AI-generated summary and recent
  review objects.
- Parser now: skips the site-level Org, walks @graph as either array
  or single object, picks the Dataset whose about.@id references the
  target domain, parses each csvw:column for rating buckets, computes
  weighted-average rating + total from the distribution, extracts the
  aiSummary text, and turns aiSummaryReviews into a clean reviews
  array with author/country/date/rating/title/text/likes.
- OG-title regex fallbacks for business_name, rating_label, and
  average_rating when the Dataset block is absent. OG-description
  regex for review_count.
- Returned shape: url, domain, business_name, rating_label,
  average_rating, review_count, rating_distribution (per-star count
  and percent), ai_summary, recent_reviews, review_count_listed,
  data_source.
- Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 /
  226 reviews with full distribution + AI summary + 2 recent reviews.

amazon_product — force-cloud-escalation + OG fallback:
- Amazon serves Product JSON-LD intermittently even on non-CAPTCHA
  pages. When local fetch returns HTML without Product JSON-LD and
  a cloud client is configured, force-escalate to the cloud path
  which reliably surfaces title + description via its render engine.
- New OG meta-tag fallback for title/image/description so the
  cloud's synthesize_html output (OG tags only, no #productTitle DOM
  ID) still yields useful data. Real Amazon pages still prefer the
  DOM regex.
- Verified live: B0BSHF7WHW escalates to cloud, returns Apple
  MacBook Pro title + description + asin.

etsy_listing — slug humanization + generic-page filtering + shop
from brand:
- Etsy serves various placeholder pages when a listing is delisted,
  blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...",
  "This item is unavailable - Etsy", plus the OG description
  "Sorry, the page you were looking for was not found." is_generic_*
  helpers catch all three shapes.
- When the OG title is generic, humanise the URL slug: the path
  `/listing/123456789/personalized-stainless-steel-tumbler` becomes
  `Personalized Stainless Steel Tumbler` so callers always get a
  meaningful title even on dead listings.
- Etsy uses `brand` (top-level JSON-LD field) for the shop name on
  listings that don't ship offers[].seller.name. Shop now falls
  through offers -> brand so either schema resolves.
- Verified live: listing/1097462299 returns full rich data
  (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating /
  225 reviews, InStock).

cloud.rs — module doc update:
- Added an architecture section documenting that api.webclaw.io does
  not return raw HTML by design and that [`synthesize_html`]
  reassembles the parsed response (metadata + structured_data +
  markdown) back into minimal HTML so existing local parsers run
  unchanged across both paths. Also notes the DOM-regex limitation
  for extractors that need live-page-specific DOM IDs.

Tests: 215 passing in webclaw-fetch (18 new), clippy clean.
Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY:
28/28 clean, 0 partial, 0 failed.
2026-04-22 17:49:50 +02:00
Valerio
a53578e45c fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product
Two targeted fixes surfaced by the manual extractor smoke test.

cloud::is_bot_protected:
- Trustpilot serves a ~565-byte AWS WAF interstitial with the string
  "Verifying your connection..." and an `interstitial-spinner` div.
  That pattern was not in our detector, so local fetch returned the
  challenge page, JSON-LD parsing found nothing, and the extractor
  emitted a confusing "no Organization/LocalBusiness JSON-LD" error.
- Added the pattern plus a <10KB size gate so real articles that
  happen to mention the phrase aren't misclassified. Two new tests
  cover positive + negative cases.
- With the fix, trustpilot_reviews now correctly escalates via
  smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY"
  actionable error without a key, or cloud-bypassed HTML with one.

ecommerce_product:
- Previously hard-failed when a page had no Product JSON-LD, and
  produced an empty `offers` list when JSON-LD was present but its
  `offers` node was. Many sites (Patagonia-style catalog pages,
  smaller Squarespace stores) ship one or the other of OG / JSON-LD
  but not both with price data.
- Added OG meta-tag fallback that handles:
  * no JSON-LD at all -> build minimal payload from og:title,
    og:image, og:description, product:price:amount,
    product:price:currency, product:availability, product:brand
  * JSON-LD present but offers empty -> augment with an OG-derived
    offer so price comes through
- New `data_source` field: "jsonld", "jsonld+og", or "og_fallback"
  so callers can tell which branch populated the data.
- `has_og_product_signal()` requires og:type=product or a price tag
  so blog posts don't get mis-classified as products.

Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
2026-04-22 17:07:31 +02:00
Valerio
7f5eb93b65 feat(extractors): wave 6b, etsy_listing + HTML fallbacks for substack/youtube
Adds etsy_listing and hardens two existing extractors with HTML fallbacks
so transient API failures still return useful data.

New:
- etsy_listing: /listing/{id}(/slug) with Schema.org Product JSON-LD +
  OG fallback. Antibot-gated, routes through cloud::smart_fetch_html
  like amazon_product and ebay_listing. Auto-dispatched (etsy host is
  unique).

Hardened:
- substack_post: when /api/v1/posts/{slug} returns non-200 (rate limit,
  403 on hardened custom domains, 5xx), fall back to HTML fetch and
  parse OG tags + Article JSON-LD. Response shape is stable across
  both paths, with a `data_source` field of "api" or "html_fallback".
- youtube_video: when ytInitialPlayerResponse is missing (EU-consent
  interstitial, age-gated, some live pre-shows), fall back to OG tags
  for title/description/thumbnail. `data_source` now "player_response"
  or "og_fallback".

Tests: 91 passing in webclaw-fetch (9 new), clippy clean.
2026-04-22 16:44:51 +02:00
Valerio
8cc727c2f2 feat(extractors): wave 6a, 5 easy verticals (27 total)
Adds 5 structured extractors that hit public APIs with stable shapes:

- github_issue: /repos/{o}/{r}/issues/{n} (rejects PRs, points to github_pr)
- shopify_collection: /collections/{handle}.json + products.json
- woocommerce_product: /wp-json/wc/store/v1/products?slug={slug}
- substack_post: /api/v1/posts/{slug} (works on custom domains too)
- youtube_video: ytInitialPlayerResponse blob from /watch HTML

Auto-dispatched: github_issue, youtube_video (unique hosts and stable
URL shapes). Explicit-call: shopify_collection, woocommerce_product,
substack_post (URL shapes overlap with non-target sites).

Tests: 82 total passing in webclaw-fetch (12 new), clippy clean.
2026-04-22 16:33:35 +02:00
Valerio
d8c9274a9c feat(extractors): wave 5 \u2014 Amazon, eBay, Trustpilot via cloud fallback
Three hard-site extractors that all require antibot bypass to ever
return usable data. They ship in OSS so the parsers + schema live
with the rest of the vertical extractors, but the fetch path routes
through cloud::smart_fetch_html \u2014 meaning:

- With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or
  WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on
  challenge-page detection we escalate to api.webclaw.io/v1/scrape
  with formats=['html'] and parse the antibot-bypassed HTML locally.

- Without a cloud key, callers get a typed CloudError::NotConfigured
  whose Display message points at https://webclaw.io/signup.
  Self-hosters without a webclaw.io account know exactly what to do.

## New extractors (all auto-dispatched \u2014 unique hosts)

- amazon_product: ASIN extraction from /dp/, /gp/product/,
  /product/, /exec/obidos/ASIN/ URL shapes across every amazon.*
  locale. Parses the Product JSON-LD Amazon ships for SEO; falls
  back to #productTitle and #landingImage DOM selectors when
  JSON-LD is absent. Returns price, currency, availability,
  condition, brand, image, aggregate rating, SKU / MPN.

- ebay_listing: item-id extraction from /itm/{id} and
  /itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr /
  .it. Parses both bare Offer (Buy It Now) and AggregateOffer
  (used-copies / auctions) from the Product JSON-LD. Returns
  price or low/high-price range, currency, condition, seller,
  offer_count, aggregate rating.

- trustpilot_reviews: reactivated from the `trustpilot_reviews`
  file that was previously dead-code'd. Parser already worked; it
  just needed the smart_fetch_html path to get past AWS WAF's
  'Verifying Connection' interstitial. Organisation / LocalBusiness
  JSON-LD block gives aggregate rating + up to 20 recent reviews.

## FetchClient change

- Added optional `cloud: Option<Arc<CloudClient>>` field with
  `FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)`
  accessor. Extractors call client.cloud() to decide whether they
  can escalate. Cheap clones (Arc-wrapped).

## webclaw-server wiring

AppState::new() now reads the cloud credential from env:

1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the
   server's own inbound bearer token.
2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open
   mode (no inbound-auth key set), matching the MCP / CLI
   convention of that env var.

When present, state.rs builds a CloudClient and attaches it to the
FetchClient via with_cloud(). Log line at startup so operators see
when cloud fallback is active.

## Catalog + dispatch

All three extractors registered in list() and in dispatch_by_url.
/v1/extractors catalog now exposes 22 verticals. Explicit
/v1/scrape/{vertical} routes work per the existing pattern.

## Tests

- 7 new unit tests (parse_asin multi-shape + parse from JSON-LD
  fixture + DOM-fallback on missing JSON-LD for Amazon; ebay
  URL-matching + slugged-URL parsing + both Offer and AggregateOffer
  fixtures).
- Full extractors suite: 68 passing (was 59, +9 from the new files).
- fmt + clippy clean.
- No live-test story for these three inside CI \u2014 verifying them
  means having WEBCLAW_CLOUD_API_KEY set against a real cloud
  backend. Integration-test harness is a separate follow-up.

Catalog summary: 22 verticals total across wave 1-5. Hard-site
three are gated behind an actionable cloud-fallback upgrade path
rather than silently returning nothing or 403-ing the caller.
2026-04-22 16:16:11 +02:00
Valerio
0221c151dc feat(extractors): wave 4 \u2014 ecommerce (shopify + generic JSON-LD)
Two ecommerce extractors covering the long tail of online stores:

- shopify_product: hits the public /products/{handle}.json endpoint
  that every Shopify store exposes. Undocumented but stable for 10+
  years. Returns title, vendor, product_type, tags, full variants
  array (price, SKU, stock, options), images, options matrix, and
  the price_min/price_max/any_available summary fields. Covers the
  ~4M Shopify stores out there, modulo stores that put Cloudflare
  in front of the shop. Rejects known non-Shopify hosts (amazon,
  etsy, walmart, etc.) to save a failed request.

- ecommerce_product: generic Schema.org Product JSON-LD extractor.
  Works on any modern store that ships the Google-required Product
  rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace,
  Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN,
  images, normalized offers (Offer and AggregateOffer flattened into
  one shape with price, currency, availability, condition),
  aggregateRating, and the raw JSON-LD block for anyone who wants it.
  Reuses webclaw_core::structured_data::extract_json_ld so the
  JSON-LD parser stays shared across the extraction pipeline.

Both are explicit-call only — /v1/scrape/shopify_product and
/v1/scrape/ecommerce_product. Not in auto-dispatch because any
arbitrary /products/{slug} URL could belong to either platform
(or to a custom site that uses the same path shape), and claiming
such URLs blindly would steal from the default markdown /v1/scrape
flow.

Live test results against real stores:
- Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images,
  Size option, all SKUs. 250ms.
- ecommerce_product / same Allbirds URL: ProductGroup schema, name
  'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer.
  300ms. Different extraction path, same product.
- ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand,
  200ms.
- Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as
  expected \u2014 the error message points callers at the ecommerce_product
  fallback, but Cloudflare also blocks the HTML path so those stores
  are cloud-tier territory.

Catalog now exposes 19 extractors via GET /v1/extractors. Unit
tests: 59 passing across the module.

Scope not in v1:
- trustpilot_reviews: file written and tested (JSON-LD walker), but
  NOT registered in the catalog or dispatch. Trustpilot's Cloudflare
  turnstile blocks our Firefox + Chrome + Safari + mobile profiles
  at the TLS layer. Shipping it would return 403 more often than 200.
  Code kept in-tree under #[allow(dead_code)] for when the cloud
  tier has residential-proxy support.
- Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF
  story. Not fixable without real browser + proxy pool.
- WooCommerce explicit: most WooCommerce stores ship Product JSON-LD,
  so ecommerce_product covers them. A dedicated WooCommerce REST
  extractor (/wp-json/wc/store/products) would be marginal on top of
  that and only works on ~30% of stores that expose the REST API.

Wave 4 positioning: we now own the OSS structured-scrape space for
any site that respects Schema.org. That's Google's entire rich-result
index \u2014 meaningful territory competitors won't try to replicate as
named endpoints.
2026-04-22 15:36:01 +02:00
Valerio
3bb0a4bca0 feat(extractors): add LinkedIn + Instagram with profile-to-posts fan-out
3 social-network extractors that work entirely without auth, using
public embed/preview endpoints + Instagram's own SEO-facing API:

- linkedin_post:      /embed/feed/update/{urn} returns full body,
                      author, image, OG tags. Accepts both the urn:li:share
                      and urn:li:activity URN forms plus the pretty
                      /posts/{slug}-{id}-{suffix} URLs.

- instagram_post:     /p/{shortcode}/embed/captioned/ returns the full
                      caption, username, thumbnail. Same endpoint serves
                      reels and IGTV, kind correctly classified.

- instagram_profile:  /api/v1/users/web_profile_info/?username=X with the
                      x-ig-app-id header (Instagram's public web-app id,
                      sent by their own JS bundle). Returns the full
                      profile + the 12 most recent posts with shortcodes,
                      kinds, like/comment counts, thumbnails, and caption
                      previews. Falls back to OG-tag scraping of the
                      public HTML if the API ever 401/403s.

The IG profile output is shaped so callers can fan out cleanly:
  for p in profile.recent_posts:
      scrape('instagram_post', p.url)
giving you 'whole profile + every recent post' in one loop. End-to-end
tested against ticketswave: 1 profile call + 12 post calls in ~3.5s.
Pagination beyond 12 posts requires authenticated cookies and is left
for the cloud where we can stash a session.

Infrastructure change: added FetchClient::fetch_with_headers so
extractors can satisfy site-specific request headers (here x-ig-app-id;
later github_pr will use this for Authorization, etc.) without polluting
the global FetchConfig.headers map. Same retry semantics as fetch().

Catalog now exposes 17 extractors via /v1/extractors. Total unit tests
across the module: 47 passing. Clippy clean. Fmt clean.

Live test on the maintainer's example URLs:
- LinkedIn post (urn:li:share:7452618582213144577): 'Orc Dev' / full body
  / shipper.club link / CDN image extracted in 250ms.
- Instagram post (DT-RICMjeK5): 835-char Slovak caption, ticketswave
  username, thumbnail. 200ms.
- Instagram profile (ticketswave): 18,473 followers (exact, not
  rounded), is_verified=True, is_business=True, biography with emojis,
  12 recent posts with shortcodes + kinds + likes. 400ms.

Out of scope for this wave (require infra we don't have):
- linkedin_profile: returns 999 to all bot UAs, needs OAuth
- facebook_post / facebook_page: content is JS-loaded, needs cloud Chrome
- facebook_profile (personal): not publicly accessible by design
2026-04-22 14:39:49 +02:00
Valerio
b041f3cddd feat(extractors): wave 2 \u2014 8 more verticals (14 total)
Adds 8 more vertical extractors using public JSON APIs. All hit
deterministic endpoints with no antibot risk. Live tests pass
against canonical URLs for each.

AI / ML ecosystem (3):
- crates_io          \u2192 crates.io/api/v1/crates/{name}
- huggingface_dataset \u2192 huggingface.co/api/datasets/{path} (handles both
                       legacy /datasets/{name} and canonical {owner}/{name})
- arxiv              \u2192 export.arxiv.org/api/query (Atom XML parsed by quick-xml)

Code / version control (2):
- github_pr      \u2192 api.github.com/repos/{owner}/{repo}/pulls/{number}
- github_release \u2192 api.github.com/repos/{owner}/{repo}/releases/tags/{tag}

Infrastructure (1):
- docker_hub \u2192 hub.docker.com/v2/repositories/{namespace}/{name}
              (official-image shorthand /_/nginx normalized to library/nginx)

Community / publishing (2):
- dev_to        \u2192 dev.to/api/articles/{username}/{slug}
- stackoverflow \u2192 api.stackexchange.com/2.3/questions/{id} + answers,
                  filter=withbody for rendered HTML, sort=votes for
                  consistent top-answers ordering

Live test results (real URLs):
- serde:                 942M downloads, 838B response
- 'Attention Is All You Need': abstract + authors, 1.8KB
- nginx official:        12.9B pulls, 21k stars, 17KB
- openai/gsm8k:          822k downloads, 1.7KB
- rust-lang/rust#138000: merged by RalfJung, +3/-2, 1KB
- webclaw v0.4.0:        2.4KB
- a real dev.to article: 2.2KB body, 3.1KB total
- python yield Q&A:      score 13133, 51 answers, 104KB

Catalog now exposes 14 extractors via GET /v1/extractors. Total
unit tests across the module: 34 passing. Clippy clean. Fmt clean.

Marketing positioning sharpens: 14 dedicated extractors, all
deterministic, all 1-credit-per-call. Firecrawl's /extract is
5 credits per call and you write the schema yourself.
2026-04-22 14:20:21 +02:00
Valerio
8ba7538c37 feat(extractors): add vertical extractors module + first 6 verticals
New extractors module returns site-specific typed JSON instead of
generic markdown. Each extractor:
- declares a URL pattern via matches()
- fetches from the site's official JSON API where one exists
- returns a typed serde_json::Value with documented field names
- exposes an INFO struct that powers the /v1/extractors catalog

First 6 verticals shipped, all hitting public JSON APIs (no HTML
scraping, zero antibot risk):

- reddit       → www.reddit.com/*/.json
- hackernews   → hn.algolia.com/api/v1/items/{id} (full thread in one call)
- github_repo  → api.github.com/repos/{owner}/{repo}
- pypi         → pypi.org/pypi/{name}/json
- npm          → registry.npmjs.org/{name} + downloads/point/last-week
- huggingface_model → huggingface.co/api/models/{owner}/{name}

Server-side routes added:
- POST /v1/scrape/{vertical}  explicit per-vertical extraction
- GET  /v1/extractors         catalog (name, label, description, url_patterns)

The dispatcher validates that URL matches the requested vertical
before running, so users get "URL doesn't match the X extractor"
instead of opaque parse failures inside the extractor.

17 unit tests cover URL matching + path parsing for each vertical.
Live tests against canonical URLs (rust-lang/rust, requests pypi,
react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post)
all return correct typed JSON in 100-300ms. Sample sizes: github
863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree).

Marketing positioning: Firecrawl charges 5 credits per /extract call
and you write the schema. Webclaw returns the same JSON in 1 credit
per /scrape/{vertical} call with hand-written deterministic
extractors per site.
2026-04-22 14:11:43 +02:00