Compare commits

...

25 commits
v0.4.0 ... main

Author SHA1 Message Date
Valerio
a5c3433372 fix(core+server): guard markdown pipe slice + detect trustpilot/reddit verify walls
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-04-23 15:26:31 +02:00
Valerio
966981bc42 fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-04-23 15:17:04 +02:00
Valerio
866fa88aa0 fix(fetch): reject HTML verification pages served at .json reddit URL 2026-04-23 15:06:35 +02:00
Valerio
b413d702b2 feat(fetch): add fetch_smart with Reddit + Akamai rescue paths, bump 0.5.6 2026-04-23 14:59:29 +02:00
Valerio
98a177dec4 feat(cli): expose safari-ios browser profile + bump to 0.5.5 2026-04-23 13:32:55 +02:00
Valerio
e1af2da509 docs(claude): drop sidecar references, mention ProductionFetcher 2026-04-23 13:25:23 +02:00
Valerio
2285c585b1 docs(changelog): simplify 0.5.4 entry 2026-04-23 13:01:02 +02:00
Valerio
b77767814a Bump to 0.5.4: SafariIos profile + Chrome fingerprint alignment + locale helper
- New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26.
  Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS
  extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers,
  gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3
  8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on
  immobiliare.it with country-it residential.

- BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped
  MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256,
  explicit extension_permutation, advertise h3 in ALPN and ALPS.
  JA3 43067709b025da334de1279a120f8e14, akamai_fp
  52d84b11737d980aef856699f885ca86. Fixes indeed.com and other
  Cloudflare-fronted sites.

- New locale module: accept_language_for_url / accept_language_for_tld.
  TLD to Accept-Language mapping, unknown TLDs default to en-US.
  DataDome geo-vs-locale cross-checks are now trivially satisfiable.

- wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.
2026-04-23 12:58:24 +02:00
Valerio
4bf11d902f fix(mcp): vertical_scrape uses Firefox profile, not default Chrome
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a
403 even from residential IPs. Their block list includes known
browser-emulation library fingerprints. wreq-Firefox passes. The
CLI `vertical` subcommand already forced Firefox; MCP
`vertical_scrape` was still falling back to the long-lived
`self.fetch_client` which defaults to Chrome, so reddit failed
on MCP and nobody noticed because the earlier test runs all had
an API key set that masked the issue.

Switched vertical_scrape to reuse `self.firefox_or_build()` which
gives us the cached Firefox client (same pattern the scrape tool
uses when the caller requests `browser: firefox`). Firefox is
strictly-safer-than-Chrome for every vertical in the catalog, so
making it the hard default for `vertical_scrape` is the right call.

Verified end-to-end from a clean shell with no WEBCLAW_API_KEY:
- MCP reddit: 679ms, post/author/6 comments correct
- MCP instagram_profile: 1157ms, 18471 followers

No change to the `scrape` tool -- it keeps the user-selectable
browser param.

Bumps version to 0.5.3.
2026-04-22 23:18:11 +02:00
Valerio
0daa2fec1a feat(cli+mcp): vertical extractor support (28 extractors discoverable + callable)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Wires the vertical extractor catalog into both the CLI and the MCP
server so users don't have to hit the HTTP API to invoke them. Same
semantics as `/v1/scrape/{vertical}` + `/v1/extractors`.

CLI (webclaw-cli):
- New subcommand `webclaw extractors` lists all 28 extractors with
  name, label, and sample URL. `--json` flag emits the full catalog
  as machine-readable JSON.
- New subcommand `webclaw vertical <name> <url>` runs a specific
  extractor and prints typed JSON. Pretty-printed by default; `--raw`
  for single-line. Exits 1 with a clear "URL does not match" error
  on mismatch.
- FetchClient built with Firefox profile + cloud fallback attached
  when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate.

MCP (webclaw-mcp):
- New tool `list_extractors` (no args) returns the catalog as
  pretty-printed JSON for in-session discovery.
- New tool `vertical_scrape` takes `{name, url}` and returns typed
  JSON. Reuses the long-lived self.fetch_client.
- Tool count goes from 10 to 12. Server-info instruction string
  updated accordingly.

Tests: 215 passing, clippy clean. Manual surface-tested end-to-end:
CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns
28-entry catalog + typed responses for pypi/requests + rust-lang/rust
in 200-400ms.

Version bumped to 0.5.2 (minor for API additions, backwards compatible).
2026-04-22 21:41:15 +02:00
Valerio
058493bc8f feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend
Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now
take `client: &dyn Fetcher` instead of `client: &FetchClient` directly.
Backwards-compatible: FetchClient implements Fetcher, blanket impls
cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server
callers keep working unchanged.

Motivation: the production API server (api.webclaw.io) must not do
in-process TLS fingerprinting; it delegates all HTTP to the Go
tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on
production would have required importing wreq into the server's
dep graph, violating the CLAUDE.md rule. Now production can provide
its own TlsSidecarFetcher implementation and pass it to the same
dispatcher the OSS server uses.

Changes:
- New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus
  blanket impls for `&T` and `Arc<T>`.
- `FetchClient` gains a tiny impl block in client.rs that forwards to
  its existing public methods.
- All 28 extractor signatures migrated from `&FetchClient` to
  `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change).
- `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`.
- `extractors::dispatch_by_url` and `extractors::dispatch_by_name`
  take `&dyn Fetcher`.
- `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has
  native async-fn-in-trait but dyn dispatch still needs async_trait).
- Version bumped to 0.5.1, CHANGELOG updated.

Tests: 215 passing in webclaw-fetch (no new tests needed — the existing
extractor tests exercise the trait methods transparently).
Clippy: clean workspace-wide.
2026-04-22 21:17:50 +02:00
Valerio
aaa5103504 docs(claude): fix stale primp references, document wreq + Fetcher trait
webclaw-fetch switched from primp to wreq 6.x (BoringSSL) a while ago
but CLAUDE.md still documented primp, the `[patch.crates-io]`
requirement, and RUSTFLAGS that no longer apply. Refreshed four
sections:

- Crate listing: webclaw-fetch uses wreq, not primp
- client.rs description: wreq BoringSSL, plus a note that FetchClient
  will implement the new Fetcher trait so production can swap in a
  tls-sidecar-backed fetcher without importing wreq
- Hard Rules: dropped obsolete `[patch.crates-io]` and RUSTFLAGS lines,
  added the "Vertical extractors take `&dyn Fetcher`" rule that makes
  the architectural separation explicit for the upcoming production
  integration
- Removed language about primp being "patched"; reqwest in webclaw-llm
  is now just "plain reqwest" with no relationship to wreq
2026-04-22 21:11:18 +02:00
Valerio
2373162c81 chore: release v0.5.0 (28 vertical extractors + cloud integration)
See CHANGELOG.md for the full entry. Headline: 28 site-specific
extractors returning typed JSON, five with automatic antibot
cloud-escalation via api.webclaw.io, `POST /v1/scrape/{vertical}` +
`GET /v1/extractors` on webclaw-server.
2026-04-22 20:59:43 +02:00
Valerio
b2e7dbf365 fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs)
Addresses the four follow-ups surfaced by the cloud-key smoke test.

trustpilot_reviews — full rewrite for 2025 schema:
- Trustpilot moved from single-Organization+aggregateRating to three
  separate JSON-LD blocks: a site-level Organization (Trustpilot
  itself), a Dataset with a csvw:Table mainEntity carrying the
  per-star distribution for the target business, and an aiSummary +
  aiSummaryReviews block with the AI-generated summary and recent
  review objects.
- Parser now: skips the site-level Org, walks @graph as either array
  or single object, picks the Dataset whose about.@id references the
  target domain, parses each csvw:column for rating buckets, computes
  weighted-average rating + total from the distribution, extracts the
  aiSummary text, and turns aiSummaryReviews into a clean reviews
  array with author/country/date/rating/title/text/likes.
- OG-title regex fallbacks for business_name, rating_label, and
  average_rating when the Dataset block is absent. OG-description
  regex for review_count.
- Returned shape: url, domain, business_name, rating_label,
  average_rating, review_count, rating_distribution (per-star count
  and percent), ai_summary, recent_reviews, review_count_listed,
  data_source.
- Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 /
  226 reviews with full distribution + AI summary + 2 recent reviews.

amazon_product — force-cloud-escalation + OG fallback:
- Amazon serves Product JSON-LD intermittently even on non-CAPTCHA
  pages. When local fetch returns HTML without Product JSON-LD and
  a cloud client is configured, force-escalate to the cloud path
  which reliably surfaces title + description via its render engine.
- New OG meta-tag fallback for title/image/description so the
  cloud's synthesize_html output (OG tags only, no #productTitle DOM
  ID) still yields useful data. Real Amazon pages still prefer the
  DOM regex.
- Verified live: B0BSHF7WHW escalates to cloud, returns Apple
  MacBook Pro title + description + asin.

etsy_listing — slug humanization + generic-page filtering + shop
from brand:
- Etsy serves various placeholder pages when a listing is delisted,
  blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...",
  "This item is unavailable - Etsy", plus the OG description
  "Sorry, the page you were looking for was not found." is_generic_*
  helpers catch all three shapes.
- When the OG title is generic, humanise the URL slug: the path
  `/listing/123456789/personalized-stainless-steel-tumbler` becomes
  `Personalized Stainless Steel Tumbler` so callers always get a
  meaningful title even on dead listings.
- Etsy uses `brand` (top-level JSON-LD field) for the shop name on
  listings that don't ship offers[].seller.name. Shop now falls
  through offers -> brand so either schema resolves.
- Verified live: listing/1097462299 returns full rich data
  (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating /
  225 reviews, InStock).

cloud.rs — module doc update:
- Added an architecture section documenting that api.webclaw.io does
  not return raw HTML by design and that [`synthesize_html`]
  reassembles the parsed response (metadata + structured_data +
  markdown) back into minimal HTML so existing local parsers run
  unchanged across both paths. Also notes the DOM-regex limitation
  for extractors that need live-page-specific DOM IDs.

Tests: 215 passing in webclaw-fetch (18 new), clippy clean.
Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY:
28/28 clean, 0 partial, 0 failed.
2026-04-22 17:49:50 +02:00
Valerio
e10066f527 fix(cloud): synthesize HTML from cloud response instead of requesting raw html
api.webclaw.io/v1/scrape does not return a `html` field even when
`formats=["html"]` is requested, by design: the cloud API returns
pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags,
title, description, image, site_name), and `markdown`.

Our CloudClient::fetch_html helper was premised on the API returning
raw HTML. Without a key set, the error message was hidden behind
CloudError::NotConfigured so the bug never surfaced. With a key set,
every extractor that escalated to cloud (trustpilot_reviews,
etsy_listing, amazon_product, ebay_listing, substack_post HTML
fallback) got back "cloud /v1/scrape returned no html field".

Fix: reassemble a minimal synthetic HTML document from the cloud's
parsed output. Each JSON-LD block goes back into a
`<script type="application/ld+json">` tag, metadata fields become OG
`<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing
local extractor parsers (find_product_jsonld, find_business,
og() regex) see the same shapes they'd see from a real page, so no
per-extractor changes needed.

Verified end-to-end with WEBCLAW_CLOUD_API_KEY set:
- trustpilot_reviews: escalates, returns Organization JSON-LD data
  (parser picks Trustpilot site-level Org not the reviewed business;
  tracked as a follow-up to update Trustpilot schema handling)
- etsy_listing: escalates via antibot render path; listing-specific
  data depends on target listing having JSON-LD (many Etsy listings
  don't)
- amazon_product, ebay_listing: stay local because their pages ship
  enough content not to trigger bot-detection escalation
- The other 24 extractors unchanged (local path, zero cloud credits)

Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
2026-04-22 17:24:50 +02:00
Valerio
a53578e45c fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product
Two targeted fixes surfaced by the manual extractor smoke test.

cloud::is_bot_protected:
- Trustpilot serves a ~565-byte AWS WAF interstitial with the string
  "Verifying your connection..." and an `interstitial-spinner` div.
  That pattern was not in our detector, so local fetch returned the
  challenge page, JSON-LD parsing found nothing, and the extractor
  emitted a confusing "no Organization/LocalBusiness JSON-LD" error.
- Added the pattern plus a <10KB size gate so real articles that
  happen to mention the phrase aren't misclassified. Two new tests
  cover positive + negative cases.
- With the fix, trustpilot_reviews now correctly escalates via
  smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY"
  actionable error without a key, or cloud-bypassed HTML with one.

ecommerce_product:
- Previously hard-failed when a page had no Product JSON-LD, and
  produced an empty `offers` list when JSON-LD was present but its
  `offers` node was. Many sites (Patagonia-style catalog pages,
  smaller Squarespace stores) ship one or the other of OG / JSON-LD
  but not both with price data.
- Added OG meta-tag fallback that handles:
  * no JSON-LD at all -> build minimal payload from og:title,
    og:image, og:description, product:price:amount,
    product:price:currency, product:availability, product:brand
  * JSON-LD present but offers empty -> augment with an OG-derived
    offer so price comes through
- New `data_source` field: "jsonld", "jsonld+og", or "og_fallback"
  so callers can tell which branch populated the data.
- `has_og_product_signal()` requires og:type=product or a price tag
  so blog posts don't get mis-classified as products.

Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
2026-04-22 17:07:31 +02:00
Valerio
7f5eb93b65 feat(extractors): wave 6b, etsy_listing + HTML fallbacks for substack/youtube
Adds etsy_listing and hardens two existing extractors with HTML fallbacks
so transient API failures still return useful data.

New:
- etsy_listing: /listing/{id}(/slug) with Schema.org Product JSON-LD +
  OG fallback. Antibot-gated, routes through cloud::smart_fetch_html
  like amazon_product and ebay_listing. Auto-dispatched (etsy host is
  unique).

Hardened:
- substack_post: when /api/v1/posts/{slug} returns non-200 (rate limit,
  403 on hardened custom domains, 5xx), fall back to HTML fetch and
  parse OG tags + Article JSON-LD. Response shape is stable across
  both paths, with a `data_source` field of "api" or "html_fallback".
- youtube_video: when ytInitialPlayerResponse is missing (EU-consent
  interstitial, age-gated, some live pre-shows), fall back to OG tags
  for title/description/thumbnail. `data_source` now "player_response"
  or "og_fallback".

Tests: 91 passing in webclaw-fetch (9 new), clippy clean.
2026-04-22 16:44:51 +02:00
Valerio
8cc727c2f2 feat(extractors): wave 6a, 5 easy verticals (27 total)
Adds 5 structured extractors that hit public APIs with stable shapes:

- github_issue: /repos/{o}/{r}/issues/{n} (rejects PRs, points to github_pr)
- shopify_collection: /collections/{handle}.json + products.json
- woocommerce_product: /wp-json/wc/store/v1/products?slug={slug}
- substack_post: /api/v1/posts/{slug} (works on custom domains too)
- youtube_video: ytInitialPlayerResponse blob from /watch HTML

Auto-dispatched: github_issue, youtube_video (unique hosts and stable
URL shapes). Explicit-call: shopify_collection, woocommerce_product,
substack_post (URL shapes overlap with non-target sites).

Tests: 82 total passing in webclaw-fetch (12 new), clippy clean.
2026-04-22 16:33:35 +02:00
Valerio
d8c9274a9c feat(extractors): wave 5 \u2014 Amazon, eBay, Trustpilot via cloud fallback
Three hard-site extractors that all require antibot bypass to ever
return usable data. They ship in OSS so the parsers + schema live
with the rest of the vertical extractors, but the fetch path routes
through cloud::smart_fetch_html \u2014 meaning:

- With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or
  WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on
  challenge-page detection we escalate to api.webclaw.io/v1/scrape
  with formats=['html'] and parse the antibot-bypassed HTML locally.

- Without a cloud key, callers get a typed CloudError::NotConfigured
  whose Display message points at https://webclaw.io/signup.
  Self-hosters without a webclaw.io account know exactly what to do.

## New extractors (all auto-dispatched \u2014 unique hosts)

- amazon_product: ASIN extraction from /dp/, /gp/product/,
  /product/, /exec/obidos/ASIN/ URL shapes across every amazon.*
  locale. Parses the Product JSON-LD Amazon ships for SEO; falls
  back to #productTitle and #landingImage DOM selectors when
  JSON-LD is absent. Returns price, currency, availability,
  condition, brand, image, aggregate rating, SKU / MPN.

- ebay_listing: item-id extraction from /itm/{id} and
  /itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr /
  .it. Parses both bare Offer (Buy It Now) and AggregateOffer
  (used-copies / auctions) from the Product JSON-LD. Returns
  price or low/high-price range, currency, condition, seller,
  offer_count, aggregate rating.

- trustpilot_reviews: reactivated from the `trustpilot_reviews`
  file that was previously dead-code'd. Parser already worked; it
  just needed the smart_fetch_html path to get past AWS WAF's
  'Verifying Connection' interstitial. Organisation / LocalBusiness
  JSON-LD block gives aggregate rating + up to 20 recent reviews.

## FetchClient change

- Added optional `cloud: Option<Arc<CloudClient>>` field with
  `FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)`
  accessor. Extractors call client.cloud() to decide whether they
  can escalate. Cheap clones (Arc-wrapped).

## webclaw-server wiring

AppState::new() now reads the cloud credential from env:

1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the
   server's own inbound bearer token.
2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open
   mode (no inbound-auth key set), matching the MCP / CLI
   convention of that env var.

When present, state.rs builds a CloudClient and attaches it to the
FetchClient via with_cloud(). Log line at startup so operators see
when cloud fallback is active.

## Catalog + dispatch

All three extractors registered in list() and in dispatch_by_url.
/v1/extractors catalog now exposes 22 verticals. Explicit
/v1/scrape/{vertical} routes work per the existing pattern.

## Tests

- 7 new unit tests (parse_asin multi-shape + parse from JSON-LD
  fixture + DOM-fallback on missing JSON-LD for Amazon; ebay
  URL-matching + slugged-URL parsing + both Offer and AggregateOffer
  fixtures).
- Full extractors suite: 68 passing (was 59, +9 from the new files).
- fmt + clippy clean.
- No live-test story for these three inside CI \u2014 verifying them
  means having WEBCLAW_CLOUD_API_KEY set against a real cloud
  backend. Integration-test harness is a separate follow-up.

Catalog summary: 22 verticals total across wave 1-5. Hard-site
three are gated behind an actionable cloud-fallback upgrade path
rather than silently returning nothing or 403-ing the caller.
2026-04-22 16:16:11 +02:00
Valerio
0ab891bd6b refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch
The local-first / cloud-fallback flow was duplicated in two places:
- webclaw-mcp/src/cloud.rs (302 lines, canonical)
- webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid
  pulling rmcp as a dep)

Move to the shared crate where all vertical extractors and the new
webclaw-server can also reach it.

## New module: webclaw-fetch/src/cloud.rs

Single canonical home. Consolidates both previous versions and
promotes the error type from stringy to typed:

- `CloudError` enum with dedicated variants for the four HTTP
  outcomes callers act on differently — 401 (key rejected),
  402 (insufficient plan), 429 (rate limited), plus ServerError /
  Network / ParseFailed. Each variant's Display message ends with
  an actionable URL (signup / pricing / dashboard) so API consumers
  can surface it verbatim.

- `From<CloudError> for String` bridge so the dozen existing
  `.await?` call sites in MCP / CLI that expected `Result<_, String>`
  keep compiling. We can migrate them to the typed error per-site
  later without a churn commit.

- `CloudClient::new(Option<&str>)` matches the CLI's `--api-key`
  flag pattern (explicit key wins, env fallback, None when empty).
  `::from_env()` kept for MCP-style call sites.

- `with_key_and_base` for staging / integration tests.

- `scrape / post / get / fetch_html` — `fetch_html` is new, a
  convenience that calls /v1/scrape with formats=["html"] and
  returns the raw HTML string so vertical extractors can plug
  antibot-bypassed HTML straight into their parsers.

- `is_bot_protected` + `needs_js_rendering` detectors moved
  over verbatim. Detection patterns are public (CF / DataDome /
  AWS WAF challenge-page signatures) — no moat leak.

- `smart_fetch` kept on the original `Result<_, String>`
  signature so MCP's six call sites compile unchanged.

- `smart_fetch_html` is new: the local-first-then-cloud flow
  for the vertical-extractor pattern, returning the typed
  `CloudError` so extractors can emit precise upgrade-path
  messages.

## Cleanup

- Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to
  `webclaw_fetch:☁️:*`. Dropped reqwest as a direct dep of
  webclaw-mcp (it only used it for the old cloud client).
- Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its
  webhook / on-change / research HTTP calls.
- webclaw-fetch now has reqwest as a direct dep. It was already
  transitively pulled in by webclaw-llm; this just makes the
  dependency relationship explicit at the call site.

## Tests

16 new unit tests cover:
- CloudError status mapping (401/402/429/5xx)
- NotConfigured error includes signup URL
- CloudClient::new explicit-key-wins-over-env + empty-string = None
- base_url strips trailing slash
- Detector matrix (CF challenge / Turnstile / real content with
  embedded Turnstile / SPA skeleton / real article with script tags)
- truncate respects char boundaries (don't slice inside UTF-8)

Full workspace test suite still passes (~500 tests). fmt + clippy
clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
Valerio
0221c151dc feat(extractors): wave 4 \u2014 ecommerce (shopify + generic JSON-LD)
Two ecommerce extractors covering the long tail of online stores:

- shopify_product: hits the public /products/{handle}.json endpoint
  that every Shopify store exposes. Undocumented but stable for 10+
  years. Returns title, vendor, product_type, tags, full variants
  array (price, SKU, stock, options), images, options matrix, and
  the price_min/price_max/any_available summary fields. Covers the
  ~4M Shopify stores out there, modulo stores that put Cloudflare
  in front of the shop. Rejects known non-Shopify hosts (amazon,
  etsy, walmart, etc.) to save a failed request.

- ecommerce_product: generic Schema.org Product JSON-LD extractor.
  Works on any modern store that ships the Google-required Product
  rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace,
  Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN,
  images, normalized offers (Offer and AggregateOffer flattened into
  one shape with price, currency, availability, condition),
  aggregateRating, and the raw JSON-LD block for anyone who wants it.
  Reuses webclaw_core::structured_data::extract_json_ld so the
  JSON-LD parser stays shared across the extraction pipeline.

Both are explicit-call only — /v1/scrape/shopify_product and
/v1/scrape/ecommerce_product. Not in auto-dispatch because any
arbitrary /products/{slug} URL could belong to either platform
(or to a custom site that uses the same path shape), and claiming
such URLs blindly would steal from the default markdown /v1/scrape
flow.

Live test results against real stores:
- Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images,
  Size option, all SKUs. 250ms.
- ecommerce_product / same Allbirds URL: ProductGroup schema, name
  'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer.
  300ms. Different extraction path, same product.
- ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand,
  200ms.
- Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as
  expected \u2014 the error message points callers at the ecommerce_product
  fallback, but Cloudflare also blocks the HTML path so those stores
  are cloud-tier territory.

Catalog now exposes 19 extractors via GET /v1/extractors. Unit
tests: 59 passing across the module.

Scope not in v1:
- trustpilot_reviews: file written and tested (JSON-LD walker), but
  NOT registered in the catalog or dispatch. Trustpilot's Cloudflare
  turnstile blocks our Firefox + Chrome + Safari + mobile profiles
  at the TLS layer. Shipping it would return 403 more often than 200.
  Code kept in-tree under #[allow(dead_code)] for when the cloud
  tier has residential-proxy support.
- Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF
  story. Not fixable without real browser + proxy pool.
- WooCommerce explicit: most WooCommerce stores ship Product JSON-LD,
  so ecommerce_product covers them. A dedicated WooCommerce REST
  extractor (/wp-json/wc/store/products) would be marginal on top of
  that and only works on ~30% of stores that expose the REST API.

Wave 4 positioning: we now own the OSS structured-scrape space for
any site that respects Schema.org. That's Google's entire rich-result
index \u2014 meaningful territory competitors won't try to replicate as
named endpoints.
2026-04-22 15:36:01 +02:00
Valerio
3bb0a4bca0 feat(extractors): add LinkedIn + Instagram with profile-to-posts fan-out
3 social-network extractors that work entirely without auth, using
public embed/preview endpoints + Instagram's own SEO-facing API:

- linkedin_post:      /embed/feed/update/{urn} returns full body,
                      author, image, OG tags. Accepts both the urn:li:share
                      and urn:li:activity URN forms plus the pretty
                      /posts/{slug}-{id}-{suffix} URLs.

- instagram_post:     /p/{shortcode}/embed/captioned/ returns the full
                      caption, username, thumbnail. Same endpoint serves
                      reels and IGTV, kind correctly classified.

- instagram_profile:  /api/v1/users/web_profile_info/?username=X with the
                      x-ig-app-id header (Instagram's public web-app id,
                      sent by their own JS bundle). Returns the full
                      profile + the 12 most recent posts with shortcodes,
                      kinds, like/comment counts, thumbnails, and caption
                      previews. Falls back to OG-tag scraping of the
                      public HTML if the API ever 401/403s.

The IG profile output is shaped so callers can fan out cleanly:
  for p in profile.recent_posts:
      scrape('instagram_post', p.url)
giving you 'whole profile + every recent post' in one loop. End-to-end
tested against ticketswave: 1 profile call + 12 post calls in ~3.5s.
Pagination beyond 12 posts requires authenticated cookies and is left
for the cloud where we can stash a session.

Infrastructure change: added FetchClient::fetch_with_headers so
extractors can satisfy site-specific request headers (here x-ig-app-id;
later github_pr will use this for Authorization, etc.) without polluting
the global FetchConfig.headers map. Same retry semantics as fetch().

Catalog now exposes 17 extractors via /v1/extractors. Total unit tests
across the module: 47 passing. Clippy clean. Fmt clean.

Live test on the maintainer's example URLs:
- LinkedIn post (urn:li:share:7452618582213144577): 'Orc Dev' / full body
  / shipper.club link / CDN image extracted in 250ms.
- Instagram post (DT-RICMjeK5): 835-char Slovak caption, ticketswave
  username, thumbnail. 200ms.
- Instagram profile (ticketswave): 18,473 followers (exact, not
  rounded), is_verified=True, is_business=True, biography with emojis,
  12 recent posts with shortcodes + kinds + likes. 400ms.

Out of scope for this wave (require infra we don't have):
- linkedin_profile: returns 999 to all bot UAs, needs OAuth
- facebook_post / facebook_page: content is JS-loaded, needs cloud Chrome
- facebook_profile (personal): not publicly accessible by design
2026-04-22 14:39:49 +02:00
Valerio
b041f3cddd feat(extractors): wave 2 \u2014 8 more verticals (14 total)
Adds 8 more vertical extractors using public JSON APIs. All hit
deterministic endpoints with no antibot risk. Live tests pass
against canonical URLs for each.

AI / ML ecosystem (3):
- crates_io          \u2192 crates.io/api/v1/crates/{name}
- huggingface_dataset \u2192 huggingface.co/api/datasets/{path} (handles both
                       legacy /datasets/{name} and canonical {owner}/{name})
- arxiv              \u2192 export.arxiv.org/api/query (Atom XML parsed by quick-xml)

Code / version control (2):
- github_pr      \u2192 api.github.com/repos/{owner}/{repo}/pulls/{number}
- github_release \u2192 api.github.com/repos/{owner}/{repo}/releases/tags/{tag}

Infrastructure (1):
- docker_hub \u2192 hub.docker.com/v2/repositories/{namespace}/{name}
              (official-image shorthand /_/nginx normalized to library/nginx)

Community / publishing (2):
- dev_to        \u2192 dev.to/api/articles/{username}/{slug}
- stackoverflow \u2192 api.stackexchange.com/2.3/questions/{id} + answers,
                  filter=withbody for rendered HTML, sort=votes for
                  consistent top-answers ordering

Live test results (real URLs):
- serde:                 942M downloads, 838B response
- 'Attention Is All You Need': abstract + authors, 1.8KB
- nginx official:        12.9B pulls, 21k stars, 17KB
- openai/gsm8k:          822k downloads, 1.7KB
- rust-lang/rust#138000: merged by RalfJung, +3/-2, 1KB
- webclaw v0.4.0:        2.4KB
- a real dev.to article: 2.2KB body, 3.1KB total
- python yield Q&A:      score 13133, 51 answers, 104KB

Catalog now exposes 14 extractors via GET /v1/extractors. Total
unit tests across the module: 34 passing. Clippy clean. Fmt clean.

Marketing positioning sharpens: 14 dedicated extractors, all
deterministic, all 1-credit-per-call. Firecrawl's /extract is
5 credits per call and you write the schema yourself.
2026-04-22 14:20:21 +02:00
Valerio
86182ef28a fix(server): switch default browser profile to Firefox
Reddit blocks wreq's Chrome 145 BoringSSL fingerprint at the JA3/JA4
TLS layer even though our HTTP headers correctly impersonate Chrome.
Curl from the same machine with the same Chrome User-Agent string
returns 200 from Reddit's .json endpoint; webclaw with the Chrome
profile returns 403. The detector clearly fingerprints below the
header layer.

Tested all six vertical extractors with the Firefox profile:
reddit, hackernews, github_repo, pypi, npm, huggingface_model all
return correct typed JSON. Firefox is a strict improvement on the
Chrome default for sites with active TLS-level bot detection, with
no regressions on the API-flavored sites that were already working.

Real fix is per-extractor preferred profile, but the structural
change to allow per-call profile selection in FetchClient is a
larger refactor. Flipping the global default is a one-line change
that ships the unblock now and lets users hit the new
/v1/scrape/{vertical} routes against Reddit immediately.
2026-04-22 14:11:55 +02:00
Valerio
8ba7538c37 feat(extractors): add vertical extractors module + first 6 verticals
New extractors module returns site-specific typed JSON instead of
generic markdown. Each extractor:
- declares a URL pattern via matches()
- fetches from the site's official JSON API where one exists
- returns a typed serde_json::Value with documented field names
- exposes an INFO struct that powers the /v1/extractors catalog

First 6 verticals shipped, all hitting public JSON APIs (no HTML
scraping, zero antibot risk):

- reddit       → www.reddit.com/*/.json
- hackernews   → hn.algolia.com/api/v1/items/{id} (full thread in one call)
- github_repo  → api.github.com/repos/{owner}/{repo}
- pypi         → pypi.org/pypi/{name}/json
- npm          → registry.npmjs.org/{name} + downloads/point/last-week
- huggingface_model → huggingface.co/api/models/{owner}/{name}

Server-side routes added:
- POST /v1/scrape/{vertical}  explicit per-vertical extraction
- GET  /v1/extractors         catalog (name, label, description, url_patterns)

The dispatcher validates that URL matches the requested vertical
before running, so users get "URL doesn't match the X extractor"
instead of opaque parse failures inside the extractor.

17 unit tests cover URL matching + path parsing for each vertical.
Live tests against canonical URLs (rust-lang/rust, requests pypi,
react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post)
all return correct typed JSON in 100-300ms. Sample sizes: github
863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree).

Marketing positioning: Firecrawl charges 5 credits per /extract call
and you write the schema. Webclaw returns the same JSON in 1 credit
per /scrape/{vertical} call with hand-written deterministic
extractors per site.
2026-04-22 14:11:43 +02:00
53 changed files with 10592 additions and 443 deletions

View file

@ -3,6 +3,87 @@
All notable changes to webclaw are documented here.
Format follows [Keep a Changelog](https://keepachangelog.com/).
## [0.5.6] — 2026-04-23
### Added
- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
### Fixed
- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
---
## [0.5.5] — 2026-04-23
### Added
- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
---
## [0.5.4] — 2026-04-23
### Added
- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles.
- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback.
### Changed
- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites.
- Bumped `wreq-util` to `3.0.0-rc.10`.
---
## [0.5.2] — 2026-04-22
### Added
- **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
### Changed
- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
---
## [0.5.1] — 2026-04-22
### Added
- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc<T>`, so `&client` coerces to `&dyn Fetcher` automatically.
The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses.
Backwards compatible. No behavior change for CLI, MCP, or OSS self-host.
### Changed
- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces.
---
## [0.5.0] — 2026-04-22
### Added
- **28 vertical extractors that return typed JSON instead of generic markdown.** New `webclaw_fetch::extractors` module with one extractor per site. Dev: reddit, hackernews, github_repo / github_pr / github_issue / github_release, crates_io, pypi, npm. AI/ML: huggingface_model, huggingface_dataset, arxiv, docker_hub. Writing: dev_to, stackoverflow, youtube_video. Social: linkedin_post, instagram_post, instagram_profile. Ecommerce: shopify_product, shopify_collection, ecommerce_product (generic Schema.org), woocommerce_product, amazon_product, ebay_listing, etsy_listing. Reviews: trustpilot_reviews, substack_post. Each extractor claims a URL pattern via a public `matches()` fn and returns a typed JSON payload with the fields callers actually want (title, price, author, rating, review count, etc.) rather than a markdown blob.
- **`POST /v1/scrape/{vertical}` on `webclaw-server` for explicit vertical routing.** Picks the parser by name, validates the URL plausibly belongs to that vertical, returns the same shape as `POST /v1/scrape` but typed. 23 of 28 verticals also auto-dispatch from a plain `POST /v1/scrape` because their URL shapes are unique enough to claim safely; the remaining 5 (`shopify_product`, `shopify_collection`, `ecommerce_product`, `woocommerce_product`, `substack_post`) use patterns that non-target sites share, so callers opt in via the `{vertical}` route.
- **`GET /v1/extractors` on `webclaw-server`.** Returns the full catalog as `{"extractors": [{"name": "...", "label": "...", "description": "...", "url_patterns": [...]}, ...]}` so clients can build tooling / autocomplete / user-facing docs off a live source.
- **Antibot cloud-escalation for 5 ecommerce + reviews verticals.** Amazon, eBay, Etsy, Trustpilot, and Substack (as HTML fallback) go through `cloud::smart_fetch_html`: try local fetch first; on bot-protection detection (Cloudflare challenge, DataDome, AWS WAF "Verifying your connection", etc.) escalate to `api.webclaw.io/v1/scrape`. Without `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` the extractor returns a typed `CloudError::NotConfigured` with an actionable signup link. With a key set, escalation is automatic. Every extractor stamps a `data_source: "local" | "cloud"` field on the response so callers can tell which path ran.
- **`cloud::synthesize_html` for cloud-bypassed extraction.** `api.webclaw.io/v1/scrape` deliberately does not return raw HTML; it returns a parsed bundle (`structured_data` JSON-LD blocks + `metadata` OG/meta tags + `markdown`). The new helper reassembles that bundle back into a minimal synthetic HTML doc (JSON-LD as `<script>` tags, metadata as OG `<meta>` tags, markdown in a `<pre>`) so existing local parsers run unchanged across both paths. No per-extractor code path branches are needed for "came from cloud" vs "came from local".
- **Trustpilot 2025 schema parser.** Trustpilot replaced their single-Organization + aggregateRating shape with three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table `mainEntity` carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent reviews. The parser walks all three, skips the site-level Org, picks the Dataset by `about.@id` matching the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and returns recent reviews with author / country / date / rating / title / text / likes.
- **OG-tag fallback in `ecommerce_product` for sites with no JSON-LD and sites with JSON-LD but empty offers.** Three paths now: `jsonld` (Schema.org Product with offers), `jsonld+og` (Product JSON-LD plus OG product tags filling in missing price), and `og_fallback` (no JSON-LD at all, build minimal payload from `og:title`, `og:image`, `og:description`, `product:price:amount`, `product:price:currency`, `product:availability`, `product:brand`). `has_og_product_signal()` gates the fallback on `og:type=product` or a price tag so blog posts don't get mis-classified as products.
- **URL-slug title fallback in `etsy_listing` for delisted / blocked pages.** When Etsy serves a placeholder page (`"etsy.com"`, `"Etsy - Your place to buy..."`, `"This item is unavailable"`), humanise the URL slug (`/listing/123/personalized-stainless-steel-tumbler` becomes `"Personalized Stainless Steel Tumbler"`) so callers always get a meaningful title. Plus shop falls through `offers[].seller.name` then top-level `brand` because Etsy uses both schemas depending on listing age.
- **Force-cloud-escalation in `amazon_product` when local HTML lacks Product JSON-LD.** Amazon A/B-tests JSON-LD presence. When local fetch succeeds but has no `Product` block and a cloud client is configured, the extractor force-escalates to the cloud which reliably surfaces title + description via its render engine. Added OG meta-tag fallback so the cloud's synthesized HTML output (OG tags only, no Amazon DOM IDs) still yields title / image / description.
- **AWS WAF "Verifying your connection" detector in `cloud::is_bot_protected`.** Trustpilot serves a `~565` byte interstitial with an `interstitial-spinner` CSS class. The detector now fires on that pattern with a `< 10_000` byte size gate to avoid false positives on real articles that happen to mention the phrase.
### Changed
- **`webclaw-fetch::FetchClient` gained an optional `cloud` field** via `with_cloud(CloudClient)`. Extractors reach it through `client.cloud()` to decide whether to escalate. `webclaw-server::AppState` reads `WEBCLAW_CLOUD_API_KEY` (preferred) or falls back to `WEBCLAW_API_KEY` only when inbound auth is not configured (open mode).
- **Consolidated `CloudClient` into `webclaw-fetch`.** Previously duplicated between `webclaw-mcp/src/cloud.rs` (302 LOC) and `webclaw-cli/src/cloud.rs` (80 LOC). Single canonical home with typed `CloudError` (`NotConfigured`, `Unauthorized`, `InsufficientPlan`, `RateLimited`, `ServerError`, `Network`, `ParseFailed`) that Display with actionable URLs; `From<CloudError> for String` bridge keeps pre-existing CLI / MCP call sites compiling unchanged during migration.
### Tests
- 215 unit tests passing in `webclaw-fetch` (100+ new, covering every extractor's matcher, URL parser, JSON-LD / OG fallback paths, and the cloud synthesis helper). `cargo clippy --workspace --release --no-deps` clean.
---
## [0.4.0] — 2026-04-22
### Added

View file

@ -11,7 +11,7 @@ webclaw/
# + ExtractionOptions (include/exclude CSS selectors)
# + diff engine (change tracking)
# + brand extraction (DOM/CSS analysis)
webclaw-fetch/ # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
webclaw-fetch/ # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
# + proxy pool rotation (per-request)
# + PDF content-type detection
# + document parsing (DOCX, XLSX, CSV)
@ -40,7 +40,7 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
- `brand.rs` — Brand identity extraction from DOM structure and CSS
### Fetch Modules (`webclaw-fetch`)
- `client.rs` — FetchClient with primp TLS impersonation
- `client.rs` — FetchClient with wreq BoringSSL TLS impersonation; implements the public `Fetcher` trait so callers (including server adapters) can swap in alternative implementations
- `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
- `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
- `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
@ -76,9 +76,10 @@ Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (R
## Hard Rules
- **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
- **No special RUSTFLAGS**`.cargo/config.toml` is currently empty of build flags. Don't add any.
- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
## Build & Test

49
Cargo.lock generated
View file

@ -2967,6 +2967,26 @@ dependencies = [
"pom",
]
[[package]]
name = "typed-builder"
version = "0.23.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda"
dependencies = [
"typed-builder-macro",
]
[[package]]
name = "typed-builder-macro"
version = "0.23.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "typed-path"
version = "0.12.3"
@ -3199,7 +3219,7 @@ dependencies = [
[[package]]
name = "webclaw-cli"
version = "0.4.0"
version = "0.5.6"
dependencies = [
"clap",
"dotenvy",
@ -3220,7 +3240,7 @@ dependencies = [
[[package]]
name = "webclaw-core"
version = "0.4.0"
version = "0.5.6"
dependencies = [
"ego-tree",
"once_cell",
@ -3238,13 +3258,16 @@ dependencies = [
[[package]]
name = "webclaw-fetch"
version = "0.4.0"
version = "0.5.6"
dependencies = [
"async-trait",
"bytes",
"calamine",
"http",
"quick-xml 0.37.5",
"rand 0.8.5",
"regex",
"reqwest",
"serde",
"serde_json",
"tempfile",
@ -3255,12 +3278,13 @@ dependencies = [
"webclaw-core",
"webclaw-pdf",
"wreq",
"wreq-util",
"zip 2.4.2",
]
[[package]]
name = "webclaw-llm"
version = "0.4.0"
version = "0.5.6"
dependencies = [
"async-trait",
"reqwest",
@ -3273,11 +3297,10 @@ dependencies = [
[[package]]
name = "webclaw-mcp"
version = "0.4.0"
version = "0.5.6"
dependencies = [
"dirs",
"dotenvy",
"reqwest",
"rmcp",
"schemars",
"serde",
@ -3294,7 +3317,7 @@ dependencies = [
[[package]]
name = "webclaw-pdf"
version = "0.4.0"
version = "0.5.6"
dependencies = [
"pdf-extract",
"thiserror",
@ -3303,7 +3326,7 @@ dependencies = [
[[package]]
name = "webclaw-server"
version = "0.4.0"
version = "0.5.6"
dependencies = [
"anyhow",
"axum",
@ -3707,6 +3730,16 @@ dependencies = [
"zstd",
]
[[package]]
name = "wreq-util"
version = "3.0.0-rc.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04"
dependencies = [
"typed-builder",
"wreq",
]
[[package]]
name = "writeable"
version = "0.6.2"

View file

@ -3,7 +3,7 @@ resolver = "2"
members = ["crates/*"]
[workspace.package]
version = "0.4.0"
version = "0.5.6"
edition = "2024"
license = "AGPL-3.0"
repository = "https://github.com/0xMassi/webclaw"

View file

@ -1,80 +0,0 @@
/// Cloud API client for automatic fallback when local extraction fails.
///
/// When WEBCLAW_API_KEY is set (or --api-key is passed), the CLI can fall back
/// to api.webclaw.io for bot-protected or JS-rendered sites. With --cloud flag,
/// all requests go through the cloud API directly.
///
/// NOTE: The canonical, full-featured cloud module lives in webclaw-mcp/src/cloud.rs
/// (smart_fetch, bot detection, JS rendering checks). This is the minimal subset
/// needed by the CLI. Kept separate to avoid pulling in rmcp via webclaw-mcp.
/// and adding webclaw-mcp as a dependency would pull in rmcp.
use serde_json::{Value, json};
const API_BASE: &str = "https://api.webclaw.io/v1";
pub struct CloudClient {
api_key: String,
http: reqwest::Client,
}
impl CloudClient {
/// Create from explicit key or WEBCLAW_API_KEY env var.
pub fn new(explicit_key: Option<&str>) -> Option<Self> {
let key = explicit_key
.map(String::from)
.or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
.filter(|k| !k.is_empty())?;
Some(Self {
api_key: key,
http: reqwest::Client::new(),
})
}
/// Scrape via the cloud API.
pub async fn scrape(
&self,
url: &str,
formats: &[&str],
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
) -> Result<Value, String> {
let mut body = json!({
"url": url,
"formats": formats,
});
if only_main_content {
body["only_main_content"] = json!(true);
}
if !include_selectors.is_empty() {
body["include_selectors"] = json!(include_selectors);
}
if !exclude_selectors.is_empty() {
body["exclude_selectors"] = json!(exclude_selectors);
}
self.post("scrape", body).await
}
async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
let resp = self
.http
.post(format!("{API_BASE}/{endpoint}"))
.header("Authorization", format!("Bearer {}", self.api_key))
.json(&body)
.timeout(std::time::Duration::from_secs(120))
.send()
.await
.map_err(|e| format!("cloud API request failed: {e}"))?;
let status = resp.status();
if !status.is_success() {
let text = resp.text().await.unwrap_or_default();
return Err(format!("cloud API error {status}: {text}"));
}
resp.json::<Value>()
.await
.map_err(|e| format!("cloud API response parse failed: {e}"))
}
}

View file

@ -1,7 +1,6 @@
/// CLI entry point -- wires webclaw-core and webclaw-fetch into a single command.
/// All extraction and fetching logic lives in sibling crates; this is pure plumbing.
mod bench;
mod cloud;
use std::io::{self, Read as _};
use std::path::{Path, PathBuf};
@ -309,6 +308,34 @@ enum Commands {
#[arg(long)]
facts: Option<PathBuf>,
},
/// List all vertical extractors in the catalog.
///
/// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
/// a human-friendly label, a one-line description, and the URL
/// patterns it claims. The same data is served by `/v1/extractors`
/// when running the REST API.
Extractors {
/// Emit JSON instead of a human-friendly table.
#[arg(long)]
json: bool,
},
/// Run a vertical extractor by name. Returns typed JSON with fields
/// specific to the target site (title, price, author, rating, etc.)
/// rather than generic markdown.
///
/// Use `webclaw extractors` to see the full list. Example:
/// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
Vertical {
/// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
name: String,
/// URL to extract.
url: String,
/// Emit compact JSON (single line). Default is pretty-printed.
#[arg(long)]
raw: bool,
},
}
#[derive(Clone, ValueEnum)]
@ -324,6 +351,9 @@ enum OutputFormat {
enum Browser {
Chrome,
Firefox,
/// Safari iOS 26. Pair with a country-matched residential proxy for sites
/// that reject non-mobile profiles.
SafariIos,
Random,
}
@ -350,6 +380,7 @@ impl From<Browser> for BrowserProfile {
match b {
Browser::Chrome => BrowserProfile::Chrome,
Browser::Firefox => BrowserProfile::Firefox,
Browser::SafariIos => BrowserProfile::SafariIos,
Browser::Random => BrowserProfile::Random,
}
}
@ -674,7 +705,7 @@ async fn fetch_and_extract(cli: &Cli) -> Result<FetchOutput, String> {
let url = normalize_url(raw_url);
let url = url.as_str();
let cloud_client = cloud::CloudClient::new(cli.api_key.as_deref());
let cloud_client = webclaw_fetch::cloud::CloudClient::new(cli.api_key.as_deref());
// --cloud: skip local, go straight to cloud API
if cli.cloud {
@ -2289,6 +2320,83 @@ async fn main() {
}
return;
}
Commands::Extractors { json } => {
let entries = webclaw_fetch::extractors::list();
if *json {
// Serialize with serde_json. ExtractorInfo derives
// Serialize so this is a one-liner.
match serde_json::to_string_pretty(&entries) {
Ok(s) => println!("{s}"),
Err(e) => {
eprintln!("error: failed to serialise catalog: {e}");
process::exit(1);
}
}
} else {
// Human-friendly table: NAME + LABEL + one URL
// pattern sample. Keeps the output scannable on a
// narrow terminal.
println!("{} vertical extractors available:\n", entries.len());
let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
for e in &entries {
let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
println!(
" {:<nw$} {:<lw$} {}",
e.name,
e.label,
pattern_sample,
nw = name_w,
lw = label_w,
);
}
println!("\nRun one: webclaw vertical <name> <url>");
}
return;
}
Commands::Vertical { name, url, raw } => {
// Build a FetchClient with cloud fallback attached when
// WEBCLAW_API_KEY is set. Antibot-gated verticals
// (amazon, ebay, etsy, trustpilot) need this to escalate
// on bot protection.
let fetch_cfg = webclaw_fetch::FetchConfig {
browser: webclaw_fetch::BrowserProfile::Firefox,
..webclaw_fetch::FetchConfig::default()
};
let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
Ok(c) => c,
Err(e) => {
eprintln!("error: failed to build fetch client: {e}");
process::exit(1);
}
};
if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
client = client.with_cloud(cloud);
}
match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
Ok(data) => {
let rendered = if *raw {
serde_json::to_string(&data)
} else {
serde_json::to_string_pretty(&data)
};
match rendered {
Ok(s) => println!("{s}"),
Err(e) => {
eprintln!("error: JSON encode failed: {e}");
process::exit(1);
}
}
}
Err(e) => {
// UrlMismatch / UnknownVertical / Fetch all get
// Display impls with actionable messages.
eprintln!("error: {e}");
process::exit(1);
}
}
return;
}
}
}

View file

@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
continue;
}
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs
if trimmed.starts_with('|') && trimmed.ends_with('|') {
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
// Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
// (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
let inner = &trimmed[1..trimmed.len() - 1];
let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
lines.push(cells.join("\t"));

View file

@ -12,12 +12,16 @@ serde = { workspace = true }
thiserror = { workspace = true }
tracing = { workspace = true }
tokio = { workspace = true }
async-trait = "0.1"
wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
wreq-util = "3.0.0-rc.10"
http = "1"
bytes = "1"
url = "2"
rand = "0.8"
quick-xml = { version = "0.37", features = ["serde"] }
regex = "1"
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
serde_json.workspace = true
calamine = "0.34"
zip = "2"

View file

@ -7,6 +7,10 @@ pub enum BrowserProfile {
#[default]
Chrome,
Firefox,
/// Safari iOS 26 (iPhone). The one profile proven to defeat
/// DataDome's immobiliare.it / idealista.it / target.com-class
/// rules when paired with a country-scoped residential proxy.
SafariIos,
/// Randomly pick from all available profiles on each request.
Random,
}
@ -18,6 +22,7 @@ pub enum BrowserVariant {
ChromeMacos,
Firefox,
Safari,
SafariIos26,
Edge,
}

View file

@ -177,6 +177,11 @@ enum ClientPool {
pub struct FetchClient {
pool: ClientPool,
pdf_mode: PdfMode,
/// Optional cloud-fallback client. Extractors that need to
/// escalate past bot protection call `client.cloud()` to get this
/// out. Stored as `Arc` so cloning a `FetchClient` (common in
/// axum state) doesn't clone the underlying reqwest pool.
cloud: Option<std::sync::Arc<crate::cloud::CloudClient>>,
}
impl FetchClient {
@ -225,13 +230,96 @@ impl FetchClient {
ClientPool::Rotating { clients }
};
Ok(Self { pool, pdf_mode })
Ok(Self {
pool,
pdf_mode,
cloud: None,
})
}
/// Attach a cloud-fallback client. Returns `self` so it composes in
/// a builder-ish way:
///
/// ```ignore
/// let client = FetchClient::new(config)?
/// .with_cloud(CloudClient::from_env()?);
/// ```
///
/// Extractors that can escalate past bot protection will call
/// `client.cloud()` internally. Sets the field regardless of
/// whether `cloud` is configured to bypass anything specific —
/// attachment is cheap (just wraps in `Arc`).
pub fn with_cloud(mut self, cloud: crate::cloud::CloudClient) -> Self {
self.cloud = Some(std::sync::Arc::new(cloud));
self
}
/// Optional cloud-fallback client, if one was attached via
/// [`Self::with_cloud`]. Extractors that handle antibot sites
/// pass this into `cloud::smart_fetch_html`.
pub fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
self.cloud.as_deref()
}
/// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
/// `.json` API, and Akamai-style challenge responses trigger a homepage
/// cookie warmup and a retry. Returns the same `FetchResult` shape as
/// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
/// server) benefits without shape churn.
///
/// This is the method most callers want. Use plain [`Self::fetch`] only
/// when you need literal no-rescue behavior (e.g. inside the rescue
/// logic itself to avoid recursion).
pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
// Reddit: the HTML page shows a verification interstitial for most
// client IPs, but appending `.json` returns the post + comment tree
// publicly. `parse_reddit_json` in downstream code knows how to read
// the result; here we just do the URL swap at the fetch layer.
if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
let json_url = crate::reddit::json_url(url);
// Reddit's public .json API serves JSON to identifiable bot
// User-Agents and blocks browser UAs with a verification wall.
// Override our Chrome-profile UA for this specific call.
let ua = concat!(
"Webclaw/",
env!("CARGO_PKG_VERSION"),
" (+https://webclaw.io)"
);
if let Ok(resp) = self
.fetch_with_headers(&json_url, &[("user-agent", ua)])
.await
&& resp.status == 200
{
let first = resp.html.trim_start().as_bytes().first().copied();
if matches!(first, Some(b'{') | Some(b'[')) {
return Ok(resp);
}
}
// If the .json fetch failed or returned HTML, fall through.
}
let resp = self.fetch(url).await?;
// Akamai / bazadebezolkohpepadr challenge: visit the homepage to
// collect warmup cookies (_abck, bm_sz, etc.), then retry.
if is_challenge_html(&resp.html)
&& let Some(homepage) = extract_homepage(url)
{
debug!("challenge detected, warming cookies via {homepage}");
let _ = self.fetch(&homepage).await;
if let Ok(retry) = self.fetch(url).await {
return Ok(retry);
}
}
Ok(resp)
}
/// Fetch a URL and return the raw HTML + response metadata.
///
/// Automatically retries on transient failures (network errors, 5xx, 429)
/// with exponential backoff: 0s, 1s (2 attempts total).
/// with exponential backoff: 0s, 1s (2 attempts total). No per-site
/// rescue logic; use [`Self::fetch_smart`] for that.
#[instrument(skip(self), fields(url = %url))]
pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
let delays = [Duration::ZERO, Duration::from_secs(1)];
@ -279,14 +367,85 @@ impl FetchClient {
/// Single fetch attempt.
async fn fetch_once(&self, url: &str) -> Result<FetchResult, FetchError> {
self.fetch_once_with_headers(url, &[]).await
}
/// Single fetch attempt with optional per-request headers appended
/// after the profile defaults. Used by extractors that need to
/// satisfy site-specific headers (e.g. `x-ig-app-id` for Instagram's
/// internal API).
async fn fetch_once_with_headers(
&self,
url: &str,
extra: &[(&str, &str)],
) -> Result<FetchResult, FetchError> {
let start = Instant::now();
let client = self.pick_client(url);
let resp = client.get(url).send().await?;
let mut req = client.get(url);
for (k, v) in extra {
req = req.header(*k, *v);
}
let resp = req.send().await?;
let response = Response::from_wreq(resp).await?;
response_to_result(response, start)
}
/// Fetch a URL with extra per-request headers appended after the
/// browser-profile defaults. Same retry semantics as `fetch`.
///
/// Use this when an upstream API requires a header the global
/// `FetchConfig.headers` shouldn't carry to other hosts (Instagram's
/// `x-ig-app-id`, GitHub's `Authorization` once we wire `GITHUB_TOKEN`,
/// Reddit's compliant UA when we add OAuth, etc.).
#[instrument(skip(self, extra), fields(url = %url, extra_count = extra.len()))]
pub async fn fetch_with_headers(
&self,
url: &str,
extra: &[(&str, &str)],
) -> Result<FetchResult, FetchError> {
let delays = [Duration::ZERO, Duration::from_secs(1)];
let mut last_err = None;
for (attempt, delay) in delays.iter().enumerate() {
if attempt > 0 {
tokio::time::sleep(*delay).await;
}
match self.fetch_once_with_headers(url, extra).await {
Ok(result) => {
if is_retryable_status(result.status) && attempt < delays.len() - 1 {
warn!(
url,
status = result.status,
attempt = attempt + 1,
"retryable status, will retry"
);
last_err = Some(FetchError::Build(format!("HTTP {}", result.status)));
continue;
}
if attempt > 0 {
debug!(url, attempt = attempt + 1, "retry succeeded");
}
return Ok(result);
}
Err(e) => {
if !is_retryable_error(&e) || attempt == delays.len() - 1 {
return Err(e);
}
warn!(
url,
error = %e,
attempt = attempt + 1,
"transient error, will retry"
);
last_err = Some(e);
}
}
}
Err(last_err.unwrap_or_else(|| FetchError::Build("all retries exhausted".into())))
}
/// Fetch a URL then extract structured content.
#[instrument(skip(self), fields(url = %url))]
pub async fn fetch_and_extract(
@ -495,12 +654,43 @@ impl FetchClient {
}
}
// ---------------------------------------------------------------------------
// Fetcher trait implementation
//
// Vertical extractors consume the [`crate::fetcher::Fetcher`] trait
// rather than `FetchClient` directly, which is what lets the production
// API server swap in a tls-sidecar-backed implementation without
// pulling wreq into its dependency graph. For everyone else (CLI, MCP,
// self-hosted OSS server) this impl means "pass the FetchClient you
// already have; nothing changes".
// ---------------------------------------------------------------------------
#[async_trait::async_trait]
impl crate::fetcher::Fetcher for FetchClient {
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
FetchClient::fetch(self, url).await
}
async fn fetch_with_headers(
&self,
url: &str,
headers: &[(&str, &str)],
) -> Result<FetchResult, FetchError> {
FetchClient::fetch_with_headers(self, url, headers).await
}
fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
FetchClient::cloud(self)
}
}
/// Collect the browser variants to use based on the browser profile.
fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
match profile {
BrowserProfile::Random => browser::all_variants(),
BrowserProfile::Chrome => vec![browser::latest_chrome()],
BrowserProfile::Firefox => vec![browser::latest_firefox()],
BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26],
}
}
@ -578,22 +768,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {
/// Detect if a response looks like a bot protection challenge page.
fn is_challenge_response(response: &Response) -> bool {
let len = response.body().len();
is_challenge_html(response.text().as_ref())
}
/// Same as `is_challenge_response`, operating on a body string directly
/// so callers holding a `FetchResult` can reuse the heuristic.
fn is_challenge_html(html: &str) -> bool {
let len = html.len();
if len > 15_000 || len == 0 {
return false;
}
let text = response.text();
let lower = text.to_lowercase();
let lower = html.to_lowercase();
if lower.contains("<title>challenge page</title>") {
return true;
}
if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
return true;
}
false
}

View file

@ -0,0 +1,853 @@
//! Cloud API fallback client for api.webclaw.io.
//!
//! When local fetch hits bot protection or a JS-only SPA, callers can
//! fall back to the hosted API which runs the full antibot / CDP
//! pipeline. This module is the shared home for that flow: previously
//! duplicated between `webclaw-mcp/src/cloud.rs` and
//! `webclaw-cli/src/cloud.rs`.
//!
//! ## Architecture
//!
//! - [`CloudClient`] — thin reqwest wrapper around the api.webclaw.io
//! REST surface. Typed errors for the four HTTP failures callers act
//! on differently (401 / 402 / 429 / other) plus network + parse.
//! - [`is_bot_protected`] / [`needs_js_rendering`] — pure detectors on
//! response bodies. The detection patterns are public (CF / DataDome
//! challenge-page signatures) so these live in OSS without leaking
//! any moat.
//! - [`smart_fetch`] — try-local-then-escalate flow returning an
//! [`ExtractionResult`] or raw cloud JSON. Kept on the original
//! `Result<_, String>` signature so the existing MCP / CLI call
//! sites work unchanged.
//! - [`smart_fetch_html`] — new convenience for the vertical-extractor
//! pattern: just give me antibot-bypassed HTML so I can run my own
//! parser on it. Returns the typed [`CloudError`] so extractors can
//! emit precise "upgrade your plan" / "invalid key" messages.
//!
//! ## Cloud response shape and [`synthesize_html`]
//!
//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
//! `html` field even when `formats=["html"]` is requested. By design
//! the cloud API returns a parsed bundle:
//!
//! ```text
//! {
//! "url": "https://...",
//! "metadata": { title, description, image, site_name, ... }, // OG / meta tags
//! "structured_data": [ { "@type": "...", ... }, ... ], // JSON-LD blocks
//! "markdown": "# Page Title\n\n...", // cleaned markdown
//! "antibot": { engine, path, user_agent }, // bypass telemetry
//! "cache": { status, age_seconds }
//! }
//! ```
//!
//! [`CloudClient::fetch_html`] reassembles that bundle back into a
//! minimal synthetic HTML document so the existing local extractor
//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
//! cloud output. Each `structured_data` entry becomes a
//! `<script type="application/ld+json">` tag; each `metadata` field
//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
//! `<pre>` inside the body. Callers that walk Schema.org blocks see
//! exactly what they'd see on a real live page.
//!
//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
//! won't hit on the synthesised HTML — those IDs only exist on live
//! Amazon pages. Extractors that need DOM regex keep OG meta tag
//! fallbacks for that reason.
//!
//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
//! signup when a site is blocked; nothing fails silently. Cloud users
//! get the escalation for free.
use std::time::Duration;
use http::HeaderMap;
use serde_json::{Value, json};
use thiserror::Error;
use tracing::{debug, info, warn};
// Client type isn't needed here anymore now that smart_fetch* takes
// `&dyn Fetcher`. Kept as a comment for historical context: this
// module used to import FetchClient directly before v0.5.1.
// ---------------------------------------------------------------------------
// URLs + defaults — keep in one place so "change the signup link" is a
// single-commit edit.
// ---------------------------------------------------------------------------
const API_BASE_DEFAULT: &str = "https://api.webclaw.io/v1";
const DEFAULT_TIMEOUT_SECS: u64 = 120;
const SIGNUP_URL: &str = "https://webclaw.io/signup";
const PRICING_URL: &str = "https://webclaw.io/pricing";
const KEYS_URL: &str = "https://webclaw.io/dashboard/api-keys";
// ---------------------------------------------------------------------------
// Errors
// ---------------------------------------------------------------------------
/// Structured cloud-fallback error. Variants correspond to the HTTP
/// outcomes callers act on differently — a 401 needs a different UX
/// than a 402 which needs a different UX than a network blip.
///
/// Display messages end with an actionable URL so API consumers can
/// surface them to users verbatim.
#[derive(Debug, Error)]
pub enum CloudError {
/// No `WEBCLAW_API_KEY` configured. Returned by [`smart_fetch_html`]
/// and friends when they hit bot protection but have no client to
/// escalate to.
#[error(
"this site is behind antibot protection. \
Set WEBCLAW_API_KEY to unlock automatic cloud bypass. \
Free tier: {SIGNUP_URL}"
)]
NotConfigured,
/// HTTP 401 — the key is present but rejected.
#[error(
"WEBCLAW_API_KEY rejected (HTTP 401). \
Check or regenerate your key at {KEYS_URL}"
)]
Unauthorized,
/// HTTP 402 — the key is valid but the plan doesn't cover the call.
#[error(
"your plan doesn't include this endpoint / site (HTTP 402). \
Upgrade at {PRICING_URL}"
)]
InsufficientPlan,
/// HTTP 429 — rate limit.
#[error(
"cloud API rate limit reached (HTTP 429). \
Wait a moment or upgrade at {PRICING_URL}"
)]
RateLimited,
/// HTTP 4xx / 5xx the caller probably can't do anything specific
/// about. Body is truncated to a sensible length for logs.
#[error("cloud API returned HTTP {status}: {body}")]
ServerError { status: u16, body: String },
#[error("cloud request failed: {0}")]
Network(String),
#[error("cloud response parse failed: {0}")]
ParseFailed(String),
}
impl CloudError {
/// Build from a non-success HTTP response, routing well-known
/// statuses to dedicated variants.
fn from_status_and_body(status: u16, body: String) -> Self {
match status {
401 => Self::Unauthorized,
402 => Self::InsufficientPlan,
429 => Self::RateLimited,
_ => Self::ServerError {
status,
body: truncate(&body, 500).to_string(),
},
}
}
}
impl From<reqwest::Error> for CloudError {
fn from(e: reqwest::Error) -> Self {
Self::Network(e.to_string())
}
}
/// Backwards-compatibility bridge: a lot of pre-existing MCP / CLI call
/// sites `use .await?` into functions returning `Result<_, String>`.
/// Having this `From` impl means those sites keep compiling while we
/// migrate them to the typed error over time.
impl From<CloudError> for String {
fn from(e: CloudError) -> Self {
e.to_string()
}
}
fn truncate(text: &str, max: usize) -> &str {
match text.char_indices().nth(max) {
Some((byte_pos, _)) => &text[..byte_pos],
None => text,
}
}
// ---------------------------------------------------------------------------
// CloudClient
// ---------------------------------------------------------------------------
/// Thin reqwest client around api.webclaw.io. Cloneable cheaply — the
/// inner `reqwest::Client` already refcounts its connection pool.
#[derive(Clone)]
pub struct CloudClient {
api_key: String,
base_url: String,
http: reqwest::Client,
}
impl CloudClient {
/// Build from an explicit key (e.g. a `--api-key` CLI flag) or fall
/// back to the `WEBCLAW_API_KEY` env var. Returns `None` when
/// neither is set / both are empty.
///
/// This is the function call sites should use by default — it's
/// what both the CLI and MCP want.
pub fn new(explicit_key: Option<&str>) -> Option<Self> {
explicit_key
.map(String::from)
.or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
.filter(|k| !k.trim().is_empty())
.map(Self::with_key)
}
/// Build from `WEBCLAW_API_KEY` env only. Thin wrapper kept for
/// readability at call sites that never accept a flag.
pub fn from_env() -> Option<Self> {
Self::new(None)
}
/// Build with an explicit key. Useful when the caller already has
/// a key from somewhere other than env or a flag (e.g. loaded from
/// config).
pub fn with_key(api_key: impl Into<String>) -> Self {
Self::with_key_and_base(api_key, API_BASE_DEFAULT)
}
/// Build with an explicit key and base URL. Used by integration
/// tests and staging deployments.
pub fn with_key_and_base(api_key: impl Into<String>, base_url: impl Into<String>) -> Self {
let http = reqwest::Client::builder()
.timeout(Duration::from_secs(DEFAULT_TIMEOUT_SECS))
.build()
.expect("reqwest client builder failed with default settings");
Self {
api_key: api_key.into(),
base_url: base_url.into().trim_end_matches('/').to_string(),
http,
}
}
pub fn base_url(&self) -> &str {
&self.base_url
}
/// Generic POST. Endpoint may be `"scrape"` or `"/scrape"` — we
/// normalise the slash.
pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, CloudError> {
let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
let resp = self
.http
.post(&url)
.header("Authorization", format!("Bearer {}", self.api_key))
.json(&body)
.send()
.await?;
parse_cloud_response(resp).await
}
/// Generic GET.
pub async fn get(&self, endpoint: &str) -> Result<Value, CloudError> {
let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
let resp = self
.http
.get(&url)
.header("Authorization", format!("Bearer {}", self.api_key))
.send()
.await?;
parse_cloud_response(resp).await
}
/// `POST /v1/scrape` with the caller's extraction options. This is
/// the public "do everything" surface: the cloud side handles
/// fetch + antibot + JS render + extraction + formatting.
pub async fn scrape(
&self,
url: &str,
formats: &[&str],
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
) -> Result<Value, CloudError> {
let mut body = json!({ "url": url, "formats": formats });
if only_main_content {
body["only_main_content"] = json!(true);
}
if !include_selectors.is_empty() {
body["include_selectors"] = json!(include_selectors);
}
if !exclude_selectors.is_empty() {
body["exclude_selectors"] = json!(exclude_selectors);
}
self.post("scrape", body).await
}
/// Get antibot-bypassed page data back as a synthetic HTML string.
///
/// `api.webclaw.io/v1/scrape` intentionally does not return raw
/// HTML: it returns pre-parsed `structured_data` (JSON-LD blocks)
/// plus `metadata` (title, description, OG tags, image) plus a
/// `markdown` body. We reassemble those into a minimal HTML doc
/// that looks enough like the real page for our local extractor
/// parsers to run unchanged: each JSON-LD block gets emitted as a
/// `<script type="application/ld+json">` tag, metadata gets
/// emitted as OG `<meta>` tags, and the markdown lands in the
/// body. Extractors that walk JSON-LD (ecommerce_product,
/// trustpilot_reviews, ebay_listing, etsy_listing, amazon_product)
/// see exactly the same shapes they'd see from a live HTML fetch.
pub async fn fetch_html(&self, url: &str) -> Result<String, CloudError> {
let resp = self.scrape(url, &["markdown"], &[], &[], false).await?;
Ok(synthesize_html(&resp))
}
}
/// Reassemble a minimal HTML document from a cloud `/v1/scrape`
/// response so existing HTML-based extractor parsers can run against
/// cloud output without a separate code path.
fn synthesize_html(resp: &Value) -> String {
let mut out = String::with_capacity(8_192);
out.push_str("<html><head>\n");
// Metadata → OG meta tags. Keep keys stable with what local
// extractors read: og:title, og:description, og:image, og:site_name.
if let Some(meta) = resp.get("metadata").and_then(|m| m.as_object()) {
for (src_key, og_key) in [
("title", "title"),
("description", "description"),
("image", "image"),
("site_name", "site_name"),
] {
if let Some(val) = meta.get(src_key).and_then(|v| v.as_str())
&& !val.is_empty()
{
out.push_str(&format!(
"<meta property=\"og:{og_key}\" content=\"{}\">\n",
html_escape_attr(val)
));
}
}
}
// Structured data blocks → <script type="application/ld+json">.
// Serialise losslessly so extract_json_ld's parser gets the same
// shape it would get from a real page.
if let Some(blocks) = resp.get("structured_data").and_then(|v| v.as_array()) {
for block in blocks {
if let Ok(s) = serde_json::to_string(block) {
out.push_str("<script type=\"application/ld+json\">");
out.push_str(&s);
out.push_str("</script>\n");
}
}
}
out.push_str("</head><body>\n");
// Markdown body → plaintext in <body>. Extractors that regex over
// <div> IDs won't hit here, but they won't hit on local cloud
// bypass either. OK to keep minimal.
if let Some(md) = resp.get("markdown").and_then(|v| v.as_str()) {
out.push_str("<pre>");
out.push_str(&html_escape_text(md));
out.push_str("</pre>\n");
}
out.push_str("</body></html>");
out
}
fn html_escape_attr(s: &str) -> String {
s.replace('&', "&amp;")
.replace('"', "&quot;")
.replace('<', "&lt;")
.replace('>', "&gt;")
}
fn html_escape_text(s: &str) -> String {
s.replace('&', "&amp;")
.replace('<', "&lt;")
.replace('>', "&gt;")
}
async fn parse_cloud_response(resp: reqwest::Response) -> Result<Value, CloudError> {
let status = resp.status();
if status.is_success() {
return resp
.json()
.await
.map_err(|e| CloudError::ParseFailed(e.to_string()));
}
let body = resp.text().await.unwrap_or_default();
Err(CloudError::from_status_and_body(status.as_u16(), body))
}
// ---------------------------------------------------------------------------
// Detection
// ---------------------------------------------------------------------------
/// True when a fetched response body is actually a bot-protection
/// challenge page rather than the content the caller asked for.
///
/// Conservative — only fires on patterns that indicate the *entire*
/// page is a challenge, not embedded CAPTCHAs on a real content page.
pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
let html_lower = html.to_lowercase();
// Cloudflare challenge page.
if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
return true;
}
// Cloudflare "Just a moment" / "Checking your browser" interstitial.
if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
&& html_lower.contains("cf-spinner")
{
return true;
}
// Cloudflare Turnstile. Only counts when the page is small —
// legitimate pages embed Turnstile for signup forms etc.
if (html_lower.contains("cf-turnstile")
|| html_lower.contains("challenges.cloudflare.com/turnstile"))
&& html.len() < 100_000
{
return true;
}
// DataDome.
if html_lower.contains("geo.captcha-delivery.com")
|| html_lower.contains("captcha-delivery.com/captcha")
{
return true;
}
// AWS WAF.
if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
return true;
}
// AWS WAF "Verifying your connection" interstitial (used by Trustpilot).
// Distinct from the captcha-branded path above: the challenge page is
// a tiny HTML shell with an `interstitial-spinner` div and no content.
// Gating on html.len() keeps false-positives off long pages that
// happen to mention the phrase in an unrelated context.
if html_lower.contains("interstitial-spinner")
&& html_lower.contains("verifying your connection")
&& html.len() < 10_000
{
return true;
}
// hCaptcha *blocking* page (not just an embedded widget).
if html_lower.contains("hcaptcha.com")
&& html_lower.contains("h-captcha")
&& html.len() < 50_000
{
return true;
}
// Cloudflare via response headers + challenge body.
let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
if has_cf_headers
&& (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
{
return true;
}
false
}
/// True when a page likely needs JS rendering — a large HTML document
/// with almost no extractable text + an SPA framework signature.
pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
let has_scripts = html.contains("<script");
// Tier 1: almost no extractable text from a large-ish page.
if word_count < 50 && html.len() > 5_000 && has_scripts {
return true;
}
// Tier 2: SPA framework markers + low content-to-HTML ratio.
if word_count < 800 && html.len() > 50_000 && has_scripts {
let html_lower = html.to_lowercase();
let has_spa_marker = html_lower.contains("react-app")
|| html_lower.contains("id=\"__next\"")
|| html_lower.contains("id=\"root\"")
|| html_lower.contains("id=\"app\"")
|| html_lower.contains("__next_data__")
|| html_lower.contains("nuxt")
|| html_lower.contains("ng-app");
if has_spa_marker {
return true;
}
}
false
}
// ---------------------------------------------------------------------------
// Smart-fetch: classic flow for MCP / CLI (returns either an extraction
// or raw cloud JSON)
// ---------------------------------------------------------------------------
/// Result of [`smart_fetch`]: either a local extraction or the raw
/// cloud API response when we escalated.
pub enum SmartFetchResult {
Local(Box<webclaw_core::ExtractionResult>),
Cloud(Value),
}
/// Try local fetch + extract first. On bot protection or detected
/// JS-render, fall back to `cloud.scrape(...)` with the caller's
/// formats. Returns `Err(String)` so existing call sites that expect
/// stringified errors keep compiling.
///
/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
/// [`CloudError`] so you can render precise UX.
pub async fn smart_fetch(
client: &dyn crate::fetcher::Fetcher,
cloud: Option<&CloudClient>,
url: &str,
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
formats: &[&str],
) -> Result<SmartFetchResult, String> {
let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
.await
.map_err(|_| format!("Fetch timed out after 30s for {url}"))?
.map_err(|e| format!("Fetch failed: {e}"))?;
if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
info!(url, "bot protection detected, falling back to cloud API");
return cloud_scrape_fallback(
cloud,
url,
include_selectors,
exclude_selectors,
only_main_content,
formats,
)
.await;
}
let options = webclaw_core::ExtractionOptions {
include_selectors: include_selectors.to_vec(),
exclude_selectors: exclude_selectors.to_vec(),
only_main_content,
include_raw_html: false,
};
let extraction =
webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
.map_err(|e| format!("Extraction failed: {e}"))?;
if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
info!(
url,
word_count = extraction.metadata.word_count,
html_len = fetch_result.html.len(),
"JS-rendered page detected, falling back to cloud API"
);
return cloud_scrape_fallback(
cloud,
url,
include_selectors,
exclude_selectors,
only_main_content,
formats,
)
.await;
}
Ok(SmartFetchResult::Local(Box::new(extraction)))
}
async fn cloud_scrape_fallback(
cloud: Option<&CloudClient>,
url: &str,
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
formats: &[&str],
) -> Result<SmartFetchResult, String> {
let Some(c) = cloud else {
return Err(CloudError::NotConfigured.to_string());
};
let resp = c
.scrape(
url,
formats,
include_selectors,
exclude_selectors,
only_main_content,
)
.await
.map_err(|e| e.to_string())?;
info!(url, "cloud API fallback successful");
Ok(SmartFetchResult::Cloud(resp))
}
// ---------------------------------------------------------------------------
// Smart-fetch-HTML: for vertical extractors
// ---------------------------------------------------------------------------
/// Where the HTML ultimately came from — useful for callers that want
/// to track "did we fall back?" for logging or pricing.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum FetchSource {
Local,
Cloud,
}
/// Antibot-aware HTML fetch result. The `html` field is always populated.
pub struct FetchedHtml {
pub html: String,
pub final_url: String,
pub source: FetchSource,
}
/// Try local fetch; on bot protection, escalate to the cloud's
/// `/v1/scrape` with `formats=["html"]` and return the raw HTML.
///
/// Designed for the vertical-extractor pattern where the caller has
/// its own parser and just needs bytes.
pub async fn smart_fetch_html(
client: &dyn crate::fetcher::Fetcher,
cloud: Option<&CloudClient>,
url: &str,
) -> Result<FetchedHtml, CloudError> {
let resp = client
.fetch(url)
.await
.map_err(|e| CloudError::Network(e.to_string()))?;
if !is_bot_protected(&resp.html, &resp.headers) {
return Ok(FetchedHtml {
html: resp.html,
final_url: resp.url,
source: FetchSource::Local,
});
}
let Some(c) = cloud else {
warn!(url, "bot protection detected + no cloud client configured");
return Err(CloudError::NotConfigured);
};
debug!(url, "bot protection detected, escalating to cloud");
let html = c.fetch_html(url).await?;
Ok(FetchedHtml {
html,
final_url: url.to_string(),
source: FetchSource::Cloud,
})
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
mod tests {
use super::*;
fn empty_headers() -> HeaderMap {
HeaderMap::new()
}
// --- detectors ----------------------------------------------------------
#[test]
fn is_bot_protected_detects_cloudflare_challenge() {
let html = "<html><body>_cf_chl_opt loaded</body></html>";
assert!(is_bot_protected(html, &empty_headers()));
}
#[test]
fn is_bot_protected_detects_turnstile_on_short_page() {
let html = "<div class=\"cf-turnstile\"></div>";
assert!(is_bot_protected(html, &empty_headers()));
}
#[test]
fn is_bot_protected_ignores_turnstile_on_real_content() {
let html = format!(
"<html><body>{}<div class=\"cf-turnstile\"></div></body></html>",
"lots of real content ".repeat(8_000)
);
assert!(!is_bot_protected(&html, &empty_headers()));
}
#[test]
fn is_bot_protected_detects_aws_waf_verifying_connection() {
// The exact shape Trustpilot serves under AWS WAF.
let html = r#"<div class="container"><div id="loading-state">
<div class="interstitial-spinner" id="spinner"></div>
<h1>Verifying your connection...</h1></div></div>"#;
assert!(is_bot_protected(html, &empty_headers()));
}
#[test]
fn synthesize_html_embeds_jsonld_and_og_tags() {
let resp = json!({
"url": "https://example.com/p/1",
"metadata": {
"title": "My Product",
"description": "A nice thing.",
"image": "https://cdn.example.com/1.jpg",
"site_name": "Example Shop"
},
"structured_data": [
{"@context":"https://schema.org","@type":"Product",
"name":"Widget","offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
],
"markdown": "# Widget\n\nA nice widget."
});
let html = synthesize_html(&resp);
// OG tags from metadata.
assert!(html.contains(r#"<meta property="og:title" content="My Product">"#));
assert!(
html.contains(r#"<meta property="og:image" content="https://cdn.example.com/1.jpg">"#)
);
// JSON-LD block preserved losslessly.
assert!(html.contains(r#"<script type="application/ld+json">"#));
assert!(html.contains(r#""@type":"Product""#));
assert!(html.contains(r#""price":"9.99""#));
// Body carries markdown.
assert!(html.contains("A nice widget."));
}
#[test]
fn synthesize_html_handles_missing_fields_gracefully() {
let resp = json!({"url": "https://example.com", "metadata": {}});
let html = synthesize_html(&resp);
// No panic, no stray unclosed tags.
assert!(html.starts_with("<html><head>"));
assert!(html.ends_with("</body></html>"));
}
#[test]
fn synthesize_html_escapes_attribute_quotes() {
let resp = json!({
"metadata": {"title": r#"She said "hi""#}
});
let html = synthesize_html(&resp);
assert!(html.contains(r#"og:title" content="She said &quot;hi&quot;""#));
}
#[test]
fn is_bot_protected_ignores_phrase_on_real_content() {
// A real article that happens to mention the phrase in prose
// should not trigger the short-page detector.
let html = format!(
"<html><body>{}<p>Verifying your connection is tricky.</p></body></html>",
"article text ".repeat(2_000)
);
assert!(!is_bot_protected(&html, &empty_headers()));
}
#[test]
fn needs_js_rendering_flags_spa_skeleton() {
let html = format!(
"<html><body><div id=\"__next\"></div>{}</body></html>",
"<script>x</script>".repeat(500)
);
assert!(needs_js_rendering(10, &html));
}
#[test]
fn needs_js_rendering_passes_real_article() {
let html = format!(
"<html><body>{}<script>x</script></body></html>",
"Real article text ".repeat(5_000)
);
assert!(!needs_js_rendering(5_000, &html));
}
// --- CloudError mapping -------------------------------------------------
#[test]
fn cloud_error_maps_401() {
let e = CloudError::from_status_and_body(401, "invalid key".into());
assert!(matches!(e, CloudError::Unauthorized));
assert!(e.to_string().contains(KEYS_URL));
}
#[test]
fn cloud_error_maps_402() {
let e = CloudError::from_status_and_body(402, "{}".into());
assert!(matches!(e, CloudError::InsufficientPlan));
assert!(e.to_string().contains(PRICING_URL));
}
#[test]
fn cloud_error_maps_429() {
let e = CloudError::from_status_and_body(429, "slow down".into());
assert!(matches!(e, CloudError::RateLimited));
assert!(e.to_string().contains(PRICING_URL));
}
#[test]
fn cloud_error_maps_generic_5xx() {
let e = CloudError::from_status_and_body(503, "x".repeat(2000));
match e {
CloudError::ServerError { status, body } => {
assert_eq!(status, 503);
assert!(body.len() <= 500);
}
_ => panic!("expected ServerError"),
}
}
#[test]
fn not_configured_error_points_at_signup() {
let msg = CloudError::NotConfigured.to_string();
assert!(msg.contains(SIGNUP_URL));
assert!(msg.contains("WEBCLAW_API_KEY"));
}
// --- CloudClient construction ------------------------------------------
#[test]
fn cloud_client_explicit_key_wins_over_env() {
// SAFETY: this test mutates process env. Serial tests only.
// Set env to something, pass an explicit key, explicit should win.
// (We don't actually *call* the API, just check the struct stored
// the right key.)
// rustc std::env::set_var is unsafe in newer toolchains.
unsafe {
std::env::set_var("WEBCLAW_API_KEY", "from-env");
}
let client = CloudClient::new(Some("from-flag")).expect("client built");
assert_eq!(client.api_key, "from-flag");
unsafe {
std::env::remove_var("WEBCLAW_API_KEY");
}
}
#[test]
fn cloud_client_none_when_empty() {
unsafe {
std::env::remove_var("WEBCLAW_API_KEY");
}
assert!(CloudClient::new(None).is_none());
assert!(CloudClient::new(Some("")).is_none());
assert!(CloudClient::new(Some(" ")).is_none());
}
#[test]
fn cloud_client_base_url_strips_trailing_slash() {
let c = CloudClient::with_key_and_base("k", "https://api.example.com/v1/");
assert_eq!(c.base_url(), "https://api.example.com/v1");
}
#[test]
fn truncate_respects_char_boundaries() {
// Ensure we don't slice inside a multi-byte char.
let s = "a".repeat(10) + "é"; // é is 2 bytes
let out = truncate(&s, 11);
assert_eq!(out.chars().count(), 11);
}
}

View file

@ -0,0 +1,452 @@
//! Amazon product detail page extractor.
//!
//! Amazon product pages (`/dp/{ASIN}/` on every locale) are
//! inconsistently protected. Sometimes our local TLS fingerprint gets
//! a real HTML page; sometimes we land on a CAPTCHA interstitial;
//! sometimes we land on a real page that for whatever reason ships
//! no Product JSON-LD (Amazon A/B-tests this regularly). So the
//! extractor has a two-stage fallback:
//!
//! 1. Try local fetch + parse. If we got Product JSON-LD back, great:
//! we have everything (title, brand, price, availability, rating).
//! 2. If local fetch worked *but the page has no Product JSON-LD* AND
//! a cloud client is configured, force-escalate to api.webclaw.io.
//! Cloud's render + antibot pipeline reliably surfaces the
//! structured data. Without a cloud client we return whatever we
//! got from local (usually just title via `#productTitle` or OG
//! meta tags).
//!
//! Parsing tries JSON-LD first, DOM regex (`#productTitle`,
//! `#landingImage`) second, OG `<meta>` tags third. The OG path
//! matters because the cloud's synthesized HTML ships metadata as
//! OG tags but lacks Amazon's DOM IDs.
//!
//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
//! path. ASINs are a stable Amazon identifier so we extract that as
//! part of the response even when everything else is empty (tells
//! callers the URL was at least recognised).
use std::sync::OnceLock;
use regex::Regex;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::cloud::{self, CloudError};
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "amazon_product",
label: "Amazon product",
description: "Returns product detail: title, brand, price, currency, availability, rating, image, ASIN. Requires WEBCLAW_API_KEY — Amazon's antibot means we always go through the cloud.",
url_patterns: &[
"https://www.amazon.com/dp/{ASIN}",
"https://www.amazon.co.uk/dp/{ASIN}",
"https://www.amazon.de/dp/{ASIN}",
"https://www.amazon.fr/dp/{ASIN}",
"https://www.amazon.it/dp/{ASIN}",
"https://www.amazon.es/dp/{ASIN}",
"https://www.amazon.co.jp/dp/{ASIN}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if !is_amazon_host(host) {
return false;
}
parse_asin(url).is_some()
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let asin = parse_asin(url)
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url)
.await
.map_err(cloud_to_fetch_err)?;
// Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA
// pages (they A/B-test it). When local fetch succeeded but has no
// Product JSON-LD, force-escalate to the cloud which runs the
// render pipeline and reliably surfaces structured data. No-op
// when cloud isn't configured — we return whatever local gave us.
if fetched.source == cloud::FetchSource::Local
&& find_product_jsonld(&fetched.html).is_none()
&& let Some(c) = client.cloud()
{
match c.fetch_html(url).await {
Ok(cloud_html) => {
fetched = cloud::FetchedHtml {
html: cloud_html,
final_url: url.to_string(),
source: cloud::FetchSource::Cloud,
};
}
Err(e) => {
tracing::debug!(
error = %e,
"amazon_product: cloud escalation failed, keeping local"
);
}
}
}
let mut data = parse(&fetched.html, url, &asin);
if let Some(obj) = data.as_object_mut() {
obj.insert(
"data_source".into(),
match fetched.source {
cloud::FetchSource::Local => json!("local"),
cloud::FetchSource::Cloud => json!("cloud"),
},
);
}
Ok(data)
}
/// Pure parser. Given HTML (from anywhere — direct, cloud, or a fixture
/// file) and the source URL, extract Amazon product detail. Returns a
/// `Value` rather than a typed struct so callers can pass it through
/// without carrying webclaw_fetch types.
pub fn parse(html: &str, url: &str, asin: &str) -> Value {
let jsonld = find_product_jsonld(html);
// Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span
// (only present on real static HTML) > cloud-synthesized og:title.
let title = jsonld
.as_ref()
.and_then(|v| get_text(v, "name"))
.or_else(|| dom_title(html))
.or_else(|| og(html, "title"));
let image = jsonld
.as_ref()
.and_then(get_first_image)
.or_else(|| dom_image(html))
.or_else(|| og(html, "image"));
let brand = jsonld.as_ref().and_then(get_brand);
let description = jsonld
.as_ref()
.and_then(|v| get_text(v, "description"))
.or_else(|| og(html, "description"));
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
let offer = jsonld.as_ref().and_then(first_offer);
let sku = jsonld.as_ref().and_then(|v| get_text(v, "sku"));
let mpn = jsonld.as_ref().and_then(|v| get_text(v, "mpn"));
json!({
"url": url,
"asin": asin,
"title": title,
"brand": brand,
"description": description,
"image": image,
"price": offer.as_ref().and_then(|o| get_text(o, "price")),
"currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
"availability": offer.as_ref().and_then(|o| {
get_text(o, "availability").map(|s|
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
}),
"condition": offer.as_ref().and_then(|o| {
get_text(o, "itemCondition").map(|s|
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
}),
"sku": sku,
"mpn": mpn,
"aggregate_rating": aggregate_rating,
})
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn is_amazon_host(host: &str) -> bool {
host.starts_with("www.amazon.") || host.starts_with("amazon.")
}
/// Pull a 10-char ASIN out of any recognised Amazon URL shape:
/// - /dp/{ASIN}
/// - /gp/product/{ASIN}
/// - /product/{ASIN}
/// - /exec/obidos/ASIN/{ASIN}
fn parse_asin(url: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r"/(?:dp|gp/product|product|ASIN)/([A-Z0-9]{10})(?:[/?#]|$)").unwrap()
});
re.captures(url)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
// ---------------------------------------------------------------------------
// JSON-LD walkers — light reuse of ecommerce_product's style
// ---------------------------------------------------------------------------
fn find_product_jsonld(html: &str) -> Option<Value> {
let blocks = webclaw_core::structured_data::extract_json_ld(html);
for b in blocks {
if let Some(found) = find_product_in(&b) {
return Some(found);
}
}
None
}
fn find_product_in(v: &Value) -> Option<Value> {
if is_product_type(v) {
return Some(v.clone());
}
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
for item in graph {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
if let Some(arr) = v.as_array() {
for item in arr {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
None
}
fn is_product_type(v: &Value) -> bool {
let Some(t) = v.get("@type") else {
return false;
};
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
match t {
Value::String(s) => is_prod(s),
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
_ => false,
}
}
fn get_text(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Number(n) => Some(n.to_string()),
_ => None,
})
}
fn get_brand(v: &Value) -> Option<String> {
let brand = v.get("brand")?;
if let Some(s) = brand.as_str() {
return Some(s.to_string());
}
brand
.as_object()
.and_then(|o| o.get("name"))
.and_then(|n| n.as_str())
.map(String::from)
}
fn get_first_image(v: &Value) -> Option<String> {
match v.get("image")? {
Value::String(s) => Some(s.clone()),
Value::Array(arr) => arr.iter().find_map(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}),
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}
}
fn first_offer(v: &Value) -> Option<Value> {
let offers = v.get("offers")?;
match offers {
Value::Array(arr) => arr.first().cloned(),
Value::Object(_) => Some(offers.clone()),
_ => None,
}
}
fn get_aggregate_rating(v: &Value) -> Option<Value> {
let r = v.get("aggregateRating")?;
Some(json!({
"rating_value": get_text(r, "ratingValue"),
"review_count": get_text(r, "reviewCount"),
"best_rating": get_text(r, "bestRating"),
}))
}
// ---------------------------------------------------------------------------
// DOM fallbacks — cheap regex for the two fields most likely to be
// missing from JSON-LD on Amazon.
// ---------------------------------------------------------------------------
fn dom_title(html: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r#"(?s)id="productTitle"[^>]*>([^<]+)<"#).unwrap());
re.captures(html)
.and_then(|c| c.get(1))
.map(|m| m.as_str().trim().to_string())
}
fn dom_image(html: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r#"id="landingImage"[^>]+src="([^"]+)""#).unwrap());
re.captures(html)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
/// OG meta tag lookup. Cloud-synthesized HTML ships these even when
/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last
/// line of defence for `title`, `image`, `description`.
fn og(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
return c.get(2).map(|m| html_unescape(m.as_str()));
}
}
None
}
/// Undo the synthesize_html attribute escaping for the few entities it
/// emits. Keeps us off a heavier HTML-entity dep.
fn html_unescape(s: &str) -> String {
s.replace("&quot;", "\"")
.replace("&amp;", "&")
.replace("&lt;", "<")
.replace("&gt;", ">")
}
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
FetchError::Build(e.to_string())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_multi_locale() {
assert!(matches("https://www.amazon.com/dp/B0CHX1W1XY"));
assert!(matches("https://www.amazon.co.uk/dp/B0CHX1W1XY/"));
assert!(matches("https://www.amazon.de/dp/B0CHX1W1XY?psc=1"));
assert!(matches(
"https://www.amazon.com/gp/product/B0CHX1W1XY/ref=foo"
));
}
#[test]
fn rejects_non_product_urls() {
assert!(!matches("https://www.amazon.com/"));
assert!(!matches("https://www.amazon.com/gp/cart"));
assert!(!matches("https://example.com/dp/B0CHX1W1XY"));
}
#[test]
fn parse_asin_extracts_from_multiple_shapes() {
assert_eq!(
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY"),
Some("B0CHX1W1XY".into())
);
assert_eq!(
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY/"),
Some("B0CHX1W1XY".into())
);
assert_eq!(
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY?psc=1"),
Some("B0CHX1W1XY".into())
);
assert_eq!(
parse_asin("https://www.amazon.com/gp/product/B0CHX1W1XY/ref=bar"),
Some("B0CHX1W1XY".into())
);
assert_eq!(
parse_asin("https://www.amazon.com/exec/obidos/ASIN/B0CHX1W1XY/baz"),
Some("B0CHX1W1XY".into())
);
assert_eq!(parse_asin("https://www.amazon.com/"), None);
}
#[test]
fn parse_extracts_from_fixture_jsonld() {
// Minimal Amazon-style fixture with a Product JSON-LD block.
let html = r##"
<html><head>
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"Product",
"name":"ACME Widget","sku":"B0CHX1W1XY",
"brand":{"@type":"Brand","name":"ACME"},
"image":"https://m.media-amazon.com/images/I/abc.jpg",
"offers":{"@type":"Offer","price":"19.99","priceCurrency":"USD",
"availability":"https://schema.org/InStock"},
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.6","reviewCount":"1234"}}
</script>
</head><body></body></html>"##;
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
assert_eq!(v["asin"], "B0CHX1W1XY");
assert_eq!(v["title"], "ACME Widget");
assert_eq!(v["brand"], "ACME");
assert_eq!(v["price"], "19.99");
assert_eq!(v["currency"], "USD");
assert_eq!(v["availability"], "InStock");
assert_eq!(v["aggregate_rating"]["rating_value"], "4.6");
assert_eq!(v["aggregate_rating"]["review_count"], "1234");
}
#[test]
fn parse_falls_back_to_dom_when_jsonld_missing_fields() {
let html = r#"
<html><body>
<span id="productTitle">Fallback Title</span>
<img id="landingImage" src="https://m.media-amazon.com/images/I/fallback.jpg" />
</body></html>
"#;
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
assert_eq!(v["title"], "Fallback Title");
assert_eq!(
v["image"],
"https://m.media-amazon.com/images/I/fallback.jpg"
);
}
#[test]
fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() {
// Shape we see from the cloud synthesize_html path: OG tags
// only, no JSON-LD, no Amazon DOM IDs.
let html = r##"<html><head>
<meta property="og:title" content="Cloud-sourced MacBook Pro">
<meta property="og:image" content="https://m.media-amazon.com/images/I/cloud.jpg">
<meta property="og:description" content="Via api.webclaw.io">
</head></html>"##;
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
assert_eq!(v["title"], "Cloud-sourced MacBook Pro");
assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg");
assert_eq!(v["description"], "Via api.webclaw.io");
}
#[test]
fn og_unescape_handles_quot_entity() {
let html = r#"<meta property="og:title" content="Apple &quot;M2 Pro&quot; Laptop">"#;
assert_eq!(
og(html, "title").as_deref(),
Some(r#"Apple "M2 Pro" Laptop"#)
);
}
}

View file

@ -0,0 +1,314 @@
//! ArXiv paper structured extractor.
//!
//! Uses the public ArXiv API at `export.arxiv.org/api/query?id_list={id}`
//! which returns Atom XML. We parse just enough to surface title, authors,
//! abstract, categories, and the canonical PDF link. No HTML scraping
//! required and no auth.
use quick_xml::Reader;
use quick_xml::events::Event;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "arxiv",
label: "ArXiv paper",
description: "Returns paper metadata: title, authors, abstract, categories, primary category, PDF URL.",
url_patterns: &[
"https://arxiv.org/abs/{id}",
"https://arxiv.org/abs/{id}v{n}",
"https://arxiv.org/pdf/{id}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host != "arxiv.org" && host != "www.arxiv.org" {
return false;
}
url.contains("/abs/") || url.contains("/pdf/")
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let id = parse_id(url)
.ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?;
let api_url = format!("https://export.arxiv.org/api/query?id_list={id}");
let resp = client.fetch(&api_url).await?;
if resp.status != 200 {
return Err(FetchError::Build(format!(
"arxiv api returned status {}",
resp.status
)));
}
let entry = parse_atom_entry(&resp.html)
.ok_or_else(|| FetchError::BodyDecode("arxiv: no <entry> in response".into()))?;
if entry.title.is_none() && entry.summary.is_none() {
return Err(FetchError::BodyDecode(format!(
"arxiv: paper '{id}' returned empty entry (likely withdrawn or invalid id)"
)));
}
Ok(json!({
"url": url,
"id": id,
"arxiv_id": entry.id,
"title": entry.title,
"authors": entry.authors,
"abstract": entry.summary.map(|s| collapse_whitespace(&s)),
"published": entry.published,
"updated": entry.updated,
"primary_category": entry.primary_category,
"categories": entry.categories,
"doi": entry.doi,
"comment": entry.comment,
"pdf_url": entry.pdf_url,
"abs_url": entry.abs_url,
}))
}
// ---------------------------------------------------------------------------
// Helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
/// Parse an arxiv id from a URL. Strips the version suffix (`v2`, `v3`)
/// and the `.pdf` extension when present.
fn parse_id(url: &str) -> Option<String> {
let after = url
.split("/abs/")
.nth(1)
.or_else(|| url.split("/pdf/").nth(1))?;
let stripped = after
.split(['?', '#'])
.next()?
.trim_end_matches('/')
.trim_end_matches(".pdf");
// Strip optional version suffix, e.g. "2401.12345v2" → "2401.12345"
let no_version = match stripped.rfind('v') {
Some(i) if stripped[i + 1..].chars().all(|c| c.is_ascii_digit()) => &stripped[..i],
_ => stripped,
};
if no_version.is_empty() {
None
} else {
Some(no_version.to_string())
}
}
fn collapse_whitespace(s: &str) -> String {
s.split_whitespace().collect::<Vec<_>>().join(" ")
}
#[derive(Default)]
struct AtomEntry {
id: Option<String>,
title: Option<String>,
summary: Option<String>,
published: Option<String>,
updated: Option<String>,
primary_category: Option<String>,
categories: Vec<String>,
authors: Vec<String>,
doi: Option<String>,
comment: Option<String>,
pdf_url: Option<String>,
abs_url: Option<String>,
}
/// Parse the first `<entry>` block of an ArXiv Atom feed.
fn parse_atom_entry(xml: &str) -> Option<AtomEntry> {
let mut reader = Reader::from_str(xml);
let mut buf = Vec::new();
// States
let mut in_entry = false;
let mut current: Option<&'static str> = None;
let mut in_author = false;
let mut in_author_name = false;
let mut entry = AtomEntry::default();
loop {
match reader.read_event_into(&mut buf) {
Ok(Event::Start(ref e)) => {
let local = e.local_name();
match local.as_ref() {
b"entry" => in_entry = true,
b"id" if in_entry && !in_author => current = Some("id"),
b"title" if in_entry => current = Some("title"),
b"summary" if in_entry => current = Some("summary"),
b"published" if in_entry => current = Some("published"),
b"updated" if in_entry => current = Some("updated"),
b"author" if in_entry => in_author = true,
b"name" if in_author => {
in_author_name = true;
current = Some("author_name");
}
b"category" if in_entry => {
// primary_category is namespaced (arxiv:primary_category)
// category is plain. quick-xml gives us local-name only,
// so we treat both as categories and take the first as
// primary.
for attr in e.attributes().flatten() {
if attr.key.as_ref() == b"term"
&& let Ok(v) = attr.unescape_value()
{
let term = v.to_string();
if entry.primary_category.is_none() {
entry.primary_category = Some(term.clone());
}
entry.categories.push(term);
}
}
}
b"link" if in_entry => {
let mut href = None;
let mut rel = None;
let mut typ = None;
for attr in e.attributes().flatten() {
match attr.key.as_ref() {
b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
_ => {}
}
}
if let Some(h) = href {
if typ.as_deref() == Some("application/pdf") {
entry.pdf_url = Some(h.clone());
}
if rel.as_deref() == Some("alternate") {
entry.abs_url = Some(h);
}
}
}
_ => current = None,
}
}
Ok(Event::Empty(ref e)) => {
// Self-closing tags (<link href="..." />). Same handling as Start.
let local = e.local_name();
if (local.as_ref() == b"link" || local.as_ref() == b"category") && in_entry {
let mut href = None;
let mut rel = None;
let mut typ = None;
let mut term = None;
for attr in e.attributes().flatten() {
match attr.key.as_ref() {
b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
b"term" => term = attr.unescape_value().ok().map(|s| s.to_string()),
_ => {}
}
}
if let Some(t) = term {
if entry.primary_category.is_none() {
entry.primary_category = Some(t.clone());
}
entry.categories.push(t);
}
if let Some(h) = href {
if typ.as_deref() == Some("application/pdf") {
entry.pdf_url = Some(h.clone());
}
if rel.as_deref() == Some("alternate") {
entry.abs_url = Some(h);
}
}
}
}
Ok(Event::Text(ref e)) => {
if let (Some(field), Ok(text)) = (current, e.unescape()) {
let text = text.to_string();
match field {
"id" => entry.id = Some(text.trim().to_string()),
"title" => entry.title = append_text(entry.title.take(), &text),
"summary" => entry.summary = append_text(entry.summary.take(), &text),
"published" => entry.published = Some(text.trim().to_string()),
"updated" => entry.updated = Some(text.trim().to_string()),
"author_name" => entry.authors.push(text.trim().to_string()),
_ => {}
}
}
}
Ok(Event::End(ref e)) => {
let local = e.local_name();
match local.as_ref() {
b"entry" => break,
b"author" => in_author = false,
b"name" => in_author_name = false,
_ => {}
}
if !in_author_name {
current = None;
}
}
Ok(Event::Eof) => break,
Err(_) => return None,
_ => {}
}
buf.clear();
}
if in_entry { Some(entry) } else { None }
}
/// Concatenate text fragments (long fields can be split across multiple
/// text events if they contain entities or CDATA).
fn append_text(prev: Option<String>, next: &str) -> Option<String> {
match prev {
Some(mut s) => {
s.push_str(next);
Some(s)
}
None => Some(next.to_string()),
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_arxiv_urls() {
assert!(matches("https://arxiv.org/abs/2401.12345"));
assert!(matches("https://arxiv.org/abs/2401.12345v2"));
assert!(matches("https://arxiv.org/pdf/2401.12345.pdf"));
assert!(!matches("https://arxiv.org/"));
assert!(!matches("https://example.com/abs/foo"));
}
#[test]
fn parse_id_strips_version_and_extension() {
assert_eq!(
parse_id("https://arxiv.org/abs/2401.12345"),
Some("2401.12345".into())
);
assert_eq!(
parse_id("https://arxiv.org/abs/2401.12345v3"),
Some("2401.12345".into())
);
assert_eq!(
parse_id("https://arxiv.org/pdf/2401.12345v2.pdf"),
Some("2401.12345".into())
);
}
#[test]
fn collapse_whitespace_handles_newlines_and_tabs() {
assert_eq!(collapse_whitespace("a b\n\tc "), "a b c");
}
}

View file

@ -0,0 +1,168 @@
//! crates.io structured extractor.
//!
//! Uses the public JSON API at `crates.io/api/v1/crates/{name}`. No
//! auth, no rate limit at normal usage. The response includes both
//! the crate metadata and the full version list, which we summarize
//! down to a count + latest release info to keep the payload small.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "crates_io",
label: "crates.io package",
description: "Returns crate metadata: latest version, dependencies, downloads, license, repository.",
url_patterns: &[
"https://crates.io/crates/{name}",
"https://crates.io/crates/{name}/{version}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host != "crates.io" && host != "www.crates.io" {
return false;
}
url.contains("/crates/")
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let name = parse_name(url)
.ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?;
let api_url = format!("https://crates.io/api/v1/crates/{name}");
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"crates.io: crate '{name}' not found"
)));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"crates.io api returned status {}",
resp.status
)));
}
let body: CratesResponse = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("crates.io parse: {e}")))?;
let c = body.crate_;
let latest_version = body
.versions
.iter()
.find(|v| !v.yanked.unwrap_or(false))
.or_else(|| body.versions.first());
Ok(json!({
"url": url,
"name": c.id,
"description": c.description,
"homepage": c.homepage,
"documentation": c.documentation,
"repository": c.repository,
"max_stable_version": c.max_stable_version,
"max_version": c.max_version,
"newest_version": c.newest_version,
"downloads": c.downloads,
"recent_downloads": c.recent_downloads,
"categories": c.categories,
"keywords": c.keywords,
"release_count": body.versions.len(),
"latest_release_date": latest_version.and_then(|v| v.created_at.clone()),
"latest_license": latest_version.and_then(|v| v.license.clone()),
"latest_rust_version": latest_version.and_then(|v| v.rust_version.clone()),
"latest_yanked": latest_version.and_then(|v| v.yanked),
"created_at": c.created_at,
"updated_at": c.updated_at,
}))
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn parse_name(url: &str) -> Option<String> {
let after = url.split("/crates/").nth(1)?;
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
let first = stripped.split('/').find(|s| !s.is_empty())?;
Some(first.to_string())
}
// ---------------------------------------------------------------------------
// crates.io API types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct CratesResponse {
#[serde(rename = "crate")]
crate_: CrateInfo,
#[serde(default)]
versions: Vec<VersionInfo>,
}
#[derive(Deserialize)]
struct CrateInfo {
id: Option<String>,
description: Option<String>,
homepage: Option<String>,
documentation: Option<String>,
repository: Option<String>,
max_stable_version: Option<String>,
max_version: Option<String>,
newest_version: Option<String>,
downloads: Option<i64>,
recent_downloads: Option<i64>,
#[serde(default)]
categories: Vec<String>,
#[serde(default)]
keywords: Vec<String>,
created_at: Option<String>,
updated_at: Option<String>,
}
#[derive(Deserialize)]
struct VersionInfo {
license: Option<String>,
rust_version: Option<String>,
yanked: Option<bool>,
created_at: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_crate_pages() {
assert!(matches("https://crates.io/crates/serde"));
assert!(matches("https://crates.io/crates/tokio/1.45.0"));
assert!(!matches("https://crates.io/"));
assert!(!matches("https://example.com/crates/foo"));
}
#[test]
fn parse_name_handles_versioned_urls() {
assert_eq!(
parse_name("https://crates.io/crates/serde"),
Some("serde".into())
);
assert_eq!(
parse_name("https://crates.io/crates/tokio/1.45.0"),
Some("tokio".into())
);
assert_eq!(
parse_name("https://crates.io/crates/scraper/?foo=bar"),
Some("scraper".into())
);
}
}

View file

@ -0,0 +1,188 @@
//! dev.to article structured extractor.
//!
//! `dev.to/api/articles/{username}/{slug}` returns the full article body,
//! tags, reaction count, comment count, and reading time. Anonymous
//! access works fine for published posts.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "dev_to",
label: "dev.to article",
description: "Returns article metadata + body: title, body markdown, tags, reactions, comments, reading time.",
url_patterns: &["https://dev.to/{username}/{slug}"],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host != "dev.to" && host != "www.dev.to" {
return false;
}
let path = url
.split("://")
.nth(1)
.and_then(|s| s.split_once('/'))
.map(|(_, p)| p)
.unwrap_or("");
let stripped = path
.split(['?', '#'])
.next()
.unwrap_or("")
.trim_end_matches('/');
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
// Need exactly /{username}/{slug}, with username starting with non-reserved.
segs.len() == 2 && !RESERVED_FIRST_SEGS.contains(&segs[0])
}
const RESERVED_FIRST_SEGS: &[&str] = &[
"api",
"tags",
"search",
"settings",
"enter",
"signup",
"about",
"code-of-conduct",
"privacy",
"terms",
"contact",
"sponsorships",
"sponsors",
"shop",
"videos",
"listings",
"podcasts",
"p",
"t",
];
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (username, slug) = parse_username_slug(url).ok_or_else(|| {
FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'"))
})?;
let api_url = format!("https://dev.to/api/articles/{username}/{slug}");
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"dev_to: article '{username}/{slug}' not found"
)));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"dev.to api returned status {}",
resp.status
)));
}
let a: Article = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("dev.to parse: {e}")))?;
Ok(json!({
"url": url,
"id": a.id,
"title": a.title,
"description": a.description,
"body_markdown": a.body_markdown,
"url_canonical": a.canonical_url,
"published_at": a.published_at,
"edited_at": a.edited_at,
"reading_time_min": a.reading_time_minutes,
"tags": a.tag_list,
"positive_reactions": a.positive_reactions_count,
"public_reactions": a.public_reactions_count,
"comments_count": a.comments_count,
"page_views_count": a.page_views_count,
"cover_image": a.cover_image,
"author": json!({
"username": a.user.as_ref().and_then(|u| u.username.clone()),
"name": a.user.as_ref().and_then(|u| u.name.clone()),
"twitter": a.user.as_ref().and_then(|u| u.twitter_username.clone()),
"github": a.user.as_ref().and_then(|u| u.github_username.clone()),
"website": a.user.as_ref().and_then(|u| u.website_url.clone()),
}),
}))
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn parse_username_slug(url: &str) -> Option<(String, String)> {
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
let username = segs.next()?;
let slug = segs.next()?;
Some((username.to_string(), slug.to_string()))
}
// ---------------------------------------------------------------------------
// dev.to API types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct Article {
id: Option<i64>,
title: Option<String>,
description: Option<String>,
body_markdown: Option<String>,
canonical_url: Option<String>,
published_at: Option<String>,
edited_at: Option<String>,
reading_time_minutes: Option<i64>,
tag_list: Option<serde_json::Value>, // string OR array depending on endpoint
positive_reactions_count: Option<i64>,
public_reactions_count: Option<i64>,
comments_count: Option<i64>,
page_views_count: Option<i64>,
cover_image: Option<String>,
user: Option<UserRef>,
}
#[derive(Deserialize)]
struct UserRef {
username: Option<String>,
name: Option<String>,
twitter_username: Option<String>,
github_username: Option<String>,
website_url: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_article_urls() {
assert!(matches("https://dev.to/ben/welcome-thread"));
assert!(matches("https://dev.to/0xmassi/some-post-1abc"));
assert!(!matches("https://dev.to/"));
assert!(!matches("https://dev.to/api/articles/foo/bar"));
assert!(!matches("https://dev.to/tags/rust"));
assert!(!matches("https://dev.to/ben")); // user profile, not article
assert!(!matches("https://example.com/ben/post"));
}
#[test]
fn parse_pulls_username_and_slug() {
assert_eq!(
parse_username_slug("https://dev.to/ben/welcome-thread"),
Some(("ben".into(), "welcome-thread".into()))
);
assert_eq!(
parse_username_slug("https://dev.to/0xmassi/some-post-1abc/?foo=bar"),
Some(("0xmassi".into(), "some-post-1abc".into()))
);
}
}

View file

@ -0,0 +1,150 @@
//! Docker Hub repository structured extractor.
//!
//! Uses the v2 JSON API at `hub.docker.com/v2/repositories/{namespace}/{name}`.
//! Anonymous access is allowed for public images. The official-image
//! shorthand (e.g. `nginx`, `redis`) is normalized to `library/{name}`.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "docker_hub",
label: "Docker Hub repository",
description: "Returns image metadata: pull count, star count, last_updated, official flag, description.",
url_patterns: &[
"https://hub.docker.com/_/{name}",
"https://hub.docker.com/r/{namespace}/{name}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host != "hub.docker.com" {
return false;
}
url.contains("/_/") || url.contains("/r/")
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (namespace, name) = parse_repo(url)
.ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?;
let api_url = format!("https://hub.docker.com/v2/repositories/{namespace}/{name}");
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"docker_hub: repo '{namespace}/{name}' not found"
)));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"docker_hub api returned status {}",
resp.status
)));
}
let r: RepoResponse = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("docker_hub parse: {e}")))?;
Ok(json!({
"url": url,
"namespace": r.namespace,
"name": r.name,
"full_name": format!("{namespace}/{name}"),
"pull_count": r.pull_count,
"star_count": r.star_count,
"description": r.description,
"full_description": r.full_description,
"last_updated": r.last_updated,
"date_registered": r.date_registered,
"is_official": namespace == "library",
"is_private": r.is_private,
"status_description":r.status_description,
"categories": r.categories,
}))
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
/// Parse `(namespace, name)` from a Docker Hub URL. The official-image
/// shorthand `/_/nginx` maps to `(library, nginx)`. Personal repos
/// `/r/foo/bar` map to `(foo, bar)`.
fn parse_repo(url: &str) -> Option<(String, String)> {
if let Some(after) = url.split("/_/").nth(1) {
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
let name = stripped.split('/').next().filter(|s| !s.is_empty())?;
return Some(("library".into(), name.to_string()));
}
let after = url.split("/r/").nth(1)?;
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
let ns = segs.next()?;
let nm = segs.next()?;
Some((ns.to_string(), nm.to_string()))
}
#[derive(Deserialize)]
struct RepoResponse {
namespace: Option<String>,
name: Option<String>,
pull_count: Option<i64>,
star_count: Option<i64>,
description: Option<String>,
full_description: Option<String>,
last_updated: Option<String>,
date_registered: Option<String>,
is_private: Option<bool>,
status_description: Option<String>,
#[serde(default)]
categories: Vec<DockerCategory>,
}
#[derive(Deserialize, serde::Serialize)]
struct DockerCategory {
name: Option<String>,
slug: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_docker_urls() {
assert!(matches("https://hub.docker.com/_/nginx"));
assert!(matches("https://hub.docker.com/r/grafana/grafana"));
assert!(!matches("https://hub.docker.com/"));
assert!(!matches("https://example.com/_/nginx"));
}
#[test]
fn parse_repo_handles_official_and_personal() {
assert_eq!(
parse_repo("https://hub.docker.com/_/nginx"),
Some(("library".into(), "nginx".into()))
);
assert_eq!(
parse_repo("https://hub.docker.com/_/nginx/tags"),
Some(("library".into(), "nginx".into()))
);
assert_eq!(
parse_repo("https://hub.docker.com/r/grafana/grafana"),
Some(("grafana".into(), "grafana".into()))
);
assert_eq!(
parse_repo("https://hub.docker.com/r/grafana/grafana/?foo=bar"),
Some(("grafana".into(), "grafana".into()))
);
}
}

View file

@ -0,0 +1,337 @@
//! eBay listing extractor.
//!
//! eBay item pages at `ebay.com/itm/{id}` and international variants
//! usually ship a `Product` JSON-LD block with title, price, currency,
//! condition, and an `AggregateOffer` when bidding. eBay applies
//! Cloudflare + custom WAF selectively — some item IDs return normal
//! HTML to the Firefox profile, others 403 / get the "Pardon our
//! interruption" page. We route through `cloud::smart_fetch_html` so
//! both paths resolve to the same parser.
use std::sync::OnceLock;
use regex::Regex;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::cloud::{self, CloudError};
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "ebay_listing",
label: "eBay listing",
description: "Returns item title, price, currency, condition, seller, shipping, and bid info. Heavy listings may need WEBCLAW_API_KEY for antibot.",
url_patterns: &[
"https://www.ebay.com/itm/{id}",
"https://www.ebay.co.uk/itm/{id}",
"https://www.ebay.de/itm/{id}",
"https://www.ebay.fr/itm/{id}",
"https://www.ebay.it/itm/{id}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if !is_ebay_host(host) {
return false;
}
parse_item_id(url).is_some()
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let item_id = parse_item_id(url)
.ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
.await
.map_err(cloud_to_fetch_err)?;
let mut data = parse(&fetched.html, url, &item_id);
if let Some(obj) = data.as_object_mut() {
obj.insert(
"data_source".into(),
match fetched.source {
cloud::FetchSource::Local => json!("local"),
cloud::FetchSource::Cloud => json!("cloud"),
},
);
}
Ok(data)
}
pub fn parse(html: &str, url: &str, item_id: &str) -> Value {
let jsonld = find_product_jsonld(html);
let title = jsonld
.as_ref()
.and_then(|v| get_text(v, "name"))
.or_else(|| og(html, "title"));
let image = jsonld
.as_ref()
.and_then(get_first_image)
.or_else(|| og(html, "image"));
let brand = jsonld.as_ref().and_then(get_brand);
let description = jsonld
.as_ref()
.and_then(|v| get_text(v, "description"))
.or_else(|| og(html, "description"));
let offer = jsonld.as_ref().and_then(first_offer);
// eBay's AggregateOffer uses lowPrice/highPrice. Offer uses price.
let (low_price, high_price, single_price) = match offer.as_ref() {
Some(o) => (
get_text(o, "lowPrice"),
get_text(o, "highPrice"),
get_text(o, "price"),
),
None => (None, None, None),
};
let offer_count = offer.as_ref().and_then(|o| get_text(o, "offerCount"));
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
json!({
"url": url,
"item_id": item_id,
"title": title,
"brand": brand,
"description": description,
"image": image,
"price": single_price,
"low_price": low_price,
"high_price": high_price,
"offer_count": offer_count,
"currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
"availability": offer.as_ref().and_then(|o| {
get_text(o, "availability").map(|s|
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
}),
"condition": offer.as_ref().and_then(|o| {
get_text(o, "itemCondition").map(|s|
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
}),
"seller": offer.as_ref().and_then(|o|
o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from)),
"aggregate_rating": aggregate_rating,
})
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn is_ebay_host(host: &str) -> bool {
host.starts_with("www.ebay.") || host.starts_with("ebay.")
}
/// Pull the numeric item id out of `/itm/{id}` or `/itm/{slug}/{id}`
/// URLs. IDs are 10-15 digits today, but we accept any all-digit
/// trailing segment so the extractor stays forward-compatible.
fn parse_item_id(url: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
// /itm/(optional-slug/)?(digits)([/?#]|end)
Regex::new(r"/itm/(?:[^/]+/)?(\d{8,})(?:[/?#]|$)").unwrap()
});
re.captures(url)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
// ---------------------------------------------------------------------------
// JSON-LD walkers
// ---------------------------------------------------------------------------
fn find_product_jsonld(html: &str) -> Option<Value> {
let blocks = webclaw_core::structured_data::extract_json_ld(html);
for b in blocks {
if let Some(found) = find_product_in(&b) {
return Some(found);
}
}
None
}
fn find_product_in(v: &Value) -> Option<Value> {
if is_product_type(v) {
return Some(v.clone());
}
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
for item in graph {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
if let Some(arr) = v.as_array() {
for item in arr {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
None
}
fn is_product_type(v: &Value) -> bool {
let Some(t) = v.get("@type") else {
return false;
};
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
match t {
Value::String(s) => is_prod(s),
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
_ => false,
}
}
fn get_text(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Number(n) => Some(n.to_string()),
_ => None,
})
}
fn get_brand(v: &Value) -> Option<String> {
let brand = v.get("brand")?;
if let Some(s) = brand.as_str() {
return Some(s.to_string());
}
brand
.as_object()
.and_then(|o| o.get("name"))
.and_then(|n| n.as_str())
.map(String::from)
}
fn get_first_image(v: &Value) -> Option<String> {
match v.get("image")? {
Value::String(s) => Some(s.clone()),
Value::Array(arr) => arr.iter().find_map(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}),
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}
}
fn first_offer(v: &Value) -> Option<Value> {
let offers = v.get("offers")?;
match offers {
Value::Array(arr) => arr.first().cloned(),
Value::Object(_) => Some(offers.clone()),
_ => None,
}
}
fn get_aggregate_rating(v: &Value) -> Option<Value> {
let r = v.get("aggregateRating")?;
Some(json!({
"rating_value": get_text(r, "ratingValue"),
"review_count": get_text(r, "reviewCount"),
"best_rating": get_text(r, "bestRating"),
}))
}
fn og(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
return c.get(2).map(|m| m.as_str().to_string());
}
}
None
}
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
FetchError::Build(e.to_string())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_ebay_item_urls() {
assert!(matches("https://www.ebay.com/itm/325478156234"));
assert!(matches(
"https://www.ebay.com/itm/vintage-typewriter/325478156234"
));
assert!(matches("https://www.ebay.co.uk/itm/325478156234"));
assert!(!matches("https://www.ebay.com/"));
assert!(!matches("https://www.ebay.com/sch/foo"));
assert!(!matches("https://example.com/itm/325478156234"));
}
#[test]
fn parse_item_id_handles_slugged_urls() {
assert_eq!(
parse_item_id("https://www.ebay.com/itm/325478156234"),
Some("325478156234".into())
);
assert_eq!(
parse_item_id("https://www.ebay.com/itm/vintage-typewriter/325478156234"),
Some("325478156234".into())
);
assert_eq!(
parse_item_id("https://www.ebay.com/itm/325478156234?hash=abc"),
Some("325478156234".into())
);
}
#[test]
fn parse_extracts_from_fixture_jsonld() {
let html = r##"
<html><head>
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"Product",
"name":"Vintage Typewriter","sku":"TW-001",
"brand":{"@type":"Brand","name":"Olivetti"},
"image":"https://i.ebayimg.com/images/abc.jpg",
"offers":{"@type":"Offer","price":"79.99","priceCurrency":"GBP",
"availability":"https://schema.org/InStock",
"itemCondition":"https://schema.org/UsedCondition",
"seller":{"@type":"Person","name":"vintage_seller_99"}}}
</script>
</head></html>"##;
let v = parse(html, "https://www.ebay.co.uk/itm/325", "325");
assert_eq!(v["title"], "Vintage Typewriter");
assert_eq!(v["price"], "79.99");
assert_eq!(v["currency"], "GBP");
assert_eq!(v["availability"], "InStock");
assert_eq!(v["condition"], "UsedCondition");
assert_eq!(v["seller"], "vintage_seller_99");
assert_eq!(v["brand"], "Olivetti");
}
#[test]
fn parse_handles_aggregate_offer_price_range() {
let html = r##"
<script type="application/ld+json">
{"@type":"Product","name":"Used Copies",
"offers":{"@type":"AggregateOffer","offerCount":"5",
"lowPrice":"10.00","highPrice":"50.00","priceCurrency":"USD"}}
</script>
"##;
let v = parse(html, "https://www.ebay.com/itm/1", "1");
assert_eq!(v["low_price"], "10.00");
assert_eq!(v["high_price"], "50.00");
assert_eq!(v["offer_count"], "5");
assert_eq!(v["currency"], "USD");
}
}

View file

@ -0,0 +1,553 @@
//! Generic ecommerce product extractor via Schema.org JSON-LD.
//!
//! Every modern ecommerce site ships a `<script type="application/ld+json">`
//! Product block for SEO / rich-result snippets. Google's own SEO docs
//! force this markup on anyone who wants to appear in shopping search.
//! We take advantage of it: one extractor that works on Shopify,
//! BigCommerce, WooCommerce, Squarespace, Magento, custom storefronts,
//! and anything else that follows Schema.org.
//!
//! **Explicit-call only** (`/v1/scrape/ecommerce_product`). Not in the
//! auto-dispatch because we can't identify "this is a product page"
//! from the URL alone. When the caller knows they have a product URL,
//! this is the reliable fallback for stores where shopify_product
//! doesn't apply.
//!
//! The extractor reuses `webclaw_core::structured_data::extract_json_ld`
//! so JSON-LD parsing is shared with the rest of the extraction
//! pipeline. We walk all blocks looking for `@type: Product`,
//! `ProductGroup`, or an `ItemList` whose first entry is a Product.
//!
//! ## OG fallback
//!
//! Two real-world cases JSON-LD alone can't cover:
//!
//! 1. Site has no Product JSON-LD at all (smaller Squarespace / custom
//! storefronts, many European shops).
//! 2. Site has Product JSON-LD but the `offers` block is empty (seen on
//! Patagonia and other catalog-style sites that split price onto a
//! separate widget).
//!
//! For case 1 we build a minimal payload from OG / product meta tags
//! (`og:title`, `og:image`, `og:description`, `product:price:amount`,
//! `product:price:currency`, `product:availability`, `product:brand`).
//! For case 2 we augment the JSON-LD offers list with an OG-derived
//! offer so callers get a price either way. A `data_source` field
//! (`"jsonld"` / `"jsonld+og"` / `"og_fallback"`) tells the caller
//! which branch produced the data.
use std::sync::OnceLock;
use regex::Regex;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "ecommerce_product",
label: "Ecommerce product (generic)",
description: "Returns product info from any site that ships Schema.org Product JSON-LD: name, description, images, brand, SKU, price, availability, aggregate rating.",
url_patterns: &[
"https://{any-ecom-store}/products/{slug}",
"https://{any-ecom-store}/product/{slug}",
"https://{any-ecom-store}/p/{slug}",
],
};
pub fn matches(url: &str) -> bool {
// Maximally permissive: explicit-call-only extractor. We trust the
// caller knows they're pointing at a product page. Custom ecom
// sites use every conceivable URL shape (warbyparker.com uses
// `/eyeglasses/{category}/{slug}/{colour}`, etc.), so path-pattern
// matching would false-negative a lot. All we gate on is a valid
// http(s) URL with a host.
if !(url.starts_with("http://") || url.starts_with("https://")) {
return false;
}
!host_of(url).is_empty()
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let resp = client.fetch(url).await?;
if !(200..300).contains(&resp.status) {
return Err(FetchError::Build(format!(
"ecommerce_product: status {} for {url}",
resp.status
)));
}
parse(&resp.html, url).ok_or_else(|| {
FetchError::BodyDecode(format!(
"ecommerce_product: no Schema.org Product JSON-LD and no OG product tags on {url}"
))
})
}
/// Pure parser: try JSON-LD first, fall back to OG meta tags. Returns
/// `None` when neither path has enough to say "this is a product page".
pub fn parse(html: &str, url: &str) -> Option<Value> {
// Reuse the core JSON-LD parser so we benefit from whatever
// robustness it gains over time (handling @graph, arrays, etc.).
let blocks = webclaw_core::structured_data::extract_json_ld(html);
let product = find_product(&blocks);
if let Some(p) = product {
Some(build_jsonld_payload(&p, html, url))
} else if has_og_product_signal(html) {
Some(build_og_payload(html, url))
} else {
None
}
}
/// Build the rich payload from a Product JSON-LD node. Augments the
/// `offers` array with an OG-derived offer when JSON-LD offers is empty
/// so callers get a price on sites like Patagonia.
fn build_jsonld_payload(product: &Value, html: &str, url: &str) -> Value {
let mut offers = collect_offers(product);
let mut data_source = "jsonld";
if offers.is_empty()
&& let Some(og_offer) = build_og_offer(html)
{
offers.push(og_offer);
data_source = "jsonld+og";
}
json!({
"url": url,
"data_source": data_source,
"name": get_text(product, "name").or_else(|| og(html, "title")),
"description": get_text(product, "description").or_else(|| og(html, "description")),
"brand": get_brand(product).or_else(|| meta_property(html, "product:brand")),
"sku": get_text(product, "sku"),
"mpn": get_text(product, "mpn"),
"gtin": get_text(product, "gtin")
.or_else(|| get_text(product, "gtin13"))
.or_else(|| get_text(product, "gtin12"))
.or_else(|| get_text(product, "gtin8")),
"product_id": get_text(product, "productID"),
"category": get_text(product, "category"),
"color": get_text(product, "color"),
"material": get_text(product, "material"),
"images": nonempty_or_og(collect_images(product), html),
"offers": offers,
"aggregate_rating": get_aggregate_rating(product),
"review_count": get_review_count(product),
"raw_schema_type": get_text(product, "@type"),
"raw_jsonld": product.clone(),
})
}
/// Build a minimal payload from OG / product meta tags. Used when a
/// page has no Product JSON-LD at all.
fn build_og_payload(html: &str, url: &str) -> Value {
let offers = build_og_offer(html).map(|o| vec![o]).unwrap_or_default();
let image = og(html, "image");
let images: Vec<Value> = image.map(|i| vec![Value::String(i)]).unwrap_or_default();
json!({
"url": url,
"data_source": "og_fallback",
"name": og(html, "title"),
"description": og(html, "description"),
"brand": meta_property(html, "product:brand"),
"sku": None::<String>,
"mpn": None::<String>,
"gtin": None::<String>,
"product_id": None::<String>,
"category": None::<String>,
"color": None::<String>,
"material": None::<String>,
"images": images,
"offers": offers,
"aggregate_rating": Value::Null,
"review_count": None::<String>,
"raw_schema_type": None::<String>,
"raw_jsonld": Value::Null,
})
}
fn nonempty_or_og(imgs: Vec<Value>, html: &str) -> Vec<Value> {
if !imgs.is_empty() {
return imgs;
}
og(html, "image")
.map(|s| vec![Value::String(s)])
.unwrap_or_default()
}
// ---------------------------------------------------------------------------
// JSON-LD walkers
// ---------------------------------------------------------------------------
/// Recursively walk the JSON-LD blocks and return the first node whose
/// `@type` is Product, ProductGroup, or IndividualProduct.
fn find_product(blocks: &[Value]) -> Option<Value> {
for b in blocks {
if let Some(found) = find_product_in(b) {
return Some(found);
}
}
None
}
fn find_product_in(v: &Value) -> Option<Value> {
if is_product_type(v) {
return Some(v.clone());
}
// @graph: [ {...}, {...} ]
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
for item in graph {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
// Bare array wrapper
if let Some(arr) = v.as_array() {
for item in arr {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
None
}
fn is_product_type(v: &Value) -> bool {
let t = match v.get("@type") {
Some(t) => t,
None => return false,
};
let match_str = |s: &str| {
matches!(
s,
"Product" | "ProductGroup" | "IndividualProduct" | "Vehicle" | "SomeProducts"
)
};
match t {
Value::String(s) => match_str(s),
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
_ => false,
}
}
fn get_text(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Number(n) => Some(n.to_string()),
_ => None,
})
}
fn get_brand(v: &Value) -> Option<String> {
let brand = v.get("brand")?;
if let Some(s) = brand.as_str() {
return Some(s.to_string());
}
if let Some(obj) = brand.as_object()
&& let Some(n) = obj.get("name").and_then(|x| x.as_str())
{
return Some(n.to_string());
}
None
}
fn collect_images(v: &Value) -> Vec<Value> {
match v.get("image") {
Some(Value::String(s)) => vec![Value::String(s.clone())],
Some(Value::Array(arr)) => arr
.iter()
.filter_map(|x| match x {
Value::String(s) => Some(Value::String(s.clone())),
Value::Object(_) => x.get("url").cloned(),
_ => None,
})
.collect(),
Some(Value::Object(o)) => o.get("url").cloned().into_iter().collect(),
_ => Vec::new(),
}
}
/// Normalise both bare Offer and AggregateOffer into a uniform array.
fn collect_offers(v: &Value) -> Vec<Value> {
let offers = match v.get("offers") {
Some(o) => o,
None => return Vec::new(),
};
let collect_single = |o: &Value| -> Option<Value> {
Some(json!({
"price": get_text(o, "price"),
"low_price": get_text(o, "lowPrice"),
"high_price": get_text(o, "highPrice"),
"currency": get_text(o, "priceCurrency"),
"availability": get_text(o, "availability").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
"item_condition": get_text(o, "itemCondition").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
"valid_until": get_text(o, "priceValidUntil"),
"url": get_text(o, "url"),
"seller": o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from),
"offer_count": get_text(o, "offerCount"),
}))
};
match offers {
Value::Array(arr) => arr.iter().filter_map(collect_single).collect(),
Value::Object(_) => collect_single(offers).into_iter().collect(),
_ => Vec::new(),
}
}
fn get_aggregate_rating(v: &Value) -> Option<Value> {
let r = v.get("aggregateRating")?;
Some(json!({
"rating_value": get_text(r, "ratingValue"),
"best_rating": get_text(r, "bestRating"),
"worst_rating": get_text(r, "worstRating"),
"rating_count": get_text(r, "ratingCount"),
"review_count": get_text(r, "reviewCount"),
}))
}
fn get_review_count(v: &Value) -> Option<String> {
v.get("aggregateRating")
.and_then(|r| get_text(r, "reviewCount"))
.or_else(|| get_text(v, "reviewCount"))
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
// ---------------------------------------------------------------------------
// OG / product meta-tag helpers
// ---------------------------------------------------------------------------
/// True when the HTML has enough OG / product meta tags to justify
/// building a fallback payload. A single `og:title` isn't enough on its
/// own — every blog post has that. We require either a product price
/// tag or at least an `og:type` of `product`/`og:product` to avoid
/// mis-classifying articles as products.
fn has_og_product_signal(html: &str) -> bool {
let has_price = meta_property(html, "product:price:amount").is_some()
|| meta_property(html, "og:price:amount").is_some();
if has_price {
return true;
}
// `<meta property="og:type" content="product">` is the Schema.org OG
// marker for product pages.
let og_type = og(html, "type").unwrap_or_default().to_lowercase();
matches!(og_type.as_str(), "product" | "og:product" | "product.item")
}
/// Build a single Offer-shaped Value from OG / product meta tags, or
/// `None` if there's no price info at all.
fn build_og_offer(html: &str) -> Option<Value> {
let price = meta_property(html, "product:price:amount")
.or_else(|| meta_property(html, "og:price:amount"));
let currency = meta_property(html, "product:price:currency")
.or_else(|| meta_property(html, "og:price:currency"));
let availability = meta_property(html, "product:availability")
.or_else(|| meta_property(html, "og:availability"));
price.as_ref()?;
Some(json!({
"price": price,
"low_price": None::<String>,
"high_price": None::<String>,
"currency": currency,
"availability": availability,
"item_condition": None::<String>,
"valid_until": None::<String>,
"url": None::<String>,
"seller": None::<String>,
"offer_count": None::<String>,
}))
}
/// Pull the value of `<meta property="og:{prop}" content="...">`.
fn og(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
return c.get(2).map(|m| m.as_str().to_string());
}
}
None
}
/// Pull the value of any `<meta property="..." content="...">` tag.
/// Needed for namespaced OG variants like `product:price:amount` that
/// the simple `og:*` matcher above doesn't cover.
fn meta_property(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
return c.get(2).map(|m| m.as_str().to_string());
}
}
None
}
#[cfg(test)]
mod tests {
use super::*;
use serde_json::json;
#[test]
fn matches_any_http_url_with_host() {
assert!(matches("https://www.allbirds.com/products/tree-runner"));
assert!(matches(
"https://www.warbyparker.com/eyeglasses/women/percey/jet-black-with-polished-gold"
));
assert!(matches("https://example.com/p/widget"));
assert!(matches("http://shop.example.com/foo/bar"));
}
#[test]
fn rejects_empty_or_non_http() {
assert!(!matches(""));
assert!(!matches("not-a-url"));
assert!(!matches("ftp://example.com/file"));
}
#[test]
fn find_product_walks_graph() {
let block = json!({
"@context": "https://schema.org",
"@graph": [
{"@type": "Organization", "name": "ACME"},
{"@type": "Product", "name": "Widget", "sku": "ABC"}
]
});
let blocks = vec![block];
let p = find_product(&blocks).unwrap();
assert_eq!(p.get("name").and_then(|v| v.as_str()), Some("Widget"));
}
#[test]
fn find_product_handles_array_type() {
let block = json!({
"@type": ["Product", "Clothing"],
"name": "Tee"
});
assert!(is_product_type(&block));
}
#[test]
fn get_brand_from_string_or_object() {
assert_eq!(get_brand(&json!({"brand": "ACME"})), Some("ACME".into()));
assert_eq!(
get_brand(&json!({"brand": {"@type": "Brand", "name": "ACME"}})),
Some("ACME".into())
);
}
#[test]
fn collect_offers_handles_single_and_aggregate() {
let p = json!({
"offers": {
"@type": "Offer",
"price": "19.99",
"priceCurrency": "USD",
"availability": "https://schema.org/InStock"
}
});
let offers = collect_offers(&p);
assert_eq!(offers.len(), 1);
assert_eq!(
offers[0].get("price").and_then(|v| v.as_str()),
Some("19.99")
);
assert_eq!(
offers[0].get("availability").and_then(|v| v.as_str()),
Some("InStock")
);
}
// --- OG fallback --------------------------------------------------------
#[test]
fn has_og_product_signal_accepts_product_type_or_price() {
let type_only = r#"<meta property="og:type" content="product">"#;
let price_only = r#"<meta property="product:price:amount" content="49.00">"#;
let neither = r#"<meta property="og:title" content="My Article"><meta property="og:type" content="article">"#;
assert!(has_og_product_signal(type_only));
assert!(has_og_product_signal(price_only));
assert!(!has_og_product_signal(neither));
}
#[test]
fn og_fallback_builds_payload_without_jsonld() {
let html = r##"<html><head>
<meta property="og:type" content="product">
<meta property="og:title" content="Handmade Candle">
<meta property="og:image" content="https://cdn.example.com/candle.jpg">
<meta property="og:description" content="Small-batch soy candle.">
<meta property="product:price:amount" content="18.00">
<meta property="product:price:currency" content="USD">
<meta property="product:availability" content="in stock">
<meta property="product:brand" content="Little Studio">
</head></html>"##;
let v = parse(html, "https://example.com/p/candle").unwrap();
assert_eq!(v["data_source"], "og_fallback");
assert_eq!(v["name"], "Handmade Candle");
assert_eq!(v["description"], "Small-batch soy candle.");
assert_eq!(v["brand"], "Little Studio");
assert_eq!(v["offers"][0]["price"], "18.00");
assert_eq!(v["offers"][0]["currency"], "USD");
assert_eq!(v["offers"][0]["availability"], "in stock");
assert_eq!(v["images"][0], "https://cdn.example.com/candle.jpg");
}
#[test]
fn jsonld_augments_empty_offers_with_og_price() {
// Patagonia-shaped page: Product JSON-LD without an Offer, plus
// product:price:* OG tags. We should merge.
let html = r##"<html><head>
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"Product",
"name":"Better Sweater","brand":"Patagonia",
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.4","reviewCount":"1142"}}
</script>
<meta property="product:price:amount" content="139.00">
<meta property="product:price:currency" content="USD">
</head></html>"##;
let v = parse(html, "https://patagonia.com/p/x").unwrap();
assert_eq!(v["data_source"], "jsonld+og");
assert_eq!(v["name"], "Better Sweater");
assert_eq!(v["offers"].as_array().unwrap().len(), 1);
assert_eq!(v["offers"][0]["price"], "139.00");
}
#[test]
fn jsonld_only_stays_pure_jsonld() {
let html = r##"<html><head>
<script type="application/ld+json">
{"@type":"Product","name":"Widget",
"offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
</script>
</head></html>"##;
let v = parse(html, "https://example.com/p/w").unwrap();
assert_eq!(v["data_source"], "jsonld");
assert_eq!(v["offers"][0]["price"], "9.99");
}
#[test]
fn parse_returns_none_on_no_product_signals() {
let html = r#"<html><head>
<meta property="og:title" content="My Blog Post">
<meta property="og:type" content="article">
</head></html>"#;
assert!(parse(html, "https://blog.example.com/post").is_none());
}
}

View file

@ -0,0 +1,572 @@
//! Etsy listing extractor.
//!
//! Etsy product pages at `etsy.com/listing/{id}` (and a sluggy variant
//! `etsy.com/listing/{id}/{slug}`) ship a Schema.org `Product` JSON-LD
//! block with title, price, currency, availability, shop seller, and
//! an `AggregateRating` for the listing.
//!
//! Etsy puts Cloudflare + custom WAF in front of product pages with a
//! high variance: the Firefox profile gets clean HTML most of the time
//! but some listings return a CF interstitial. We route through
//! `cloud::smart_fetch_html` so both paths resolve to the same parser,
//! same as `ebay_listing`.
//!
//! ## URL slug as last-resort title
//!
//! Even with cloud antibot bypass, Etsy frequently serves a generic
//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD,
//! empty markdown). In that case we humanise the slug from the URL
//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes
//! "Personalized Stainless Steel Tumbler") so callers always get a
//! meaningful title. Degrades gracefully when the URL has no slug.
use std::sync::OnceLock;
use regex::Regex;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::cloud::{self, CloudError};
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "etsy_listing",
label: "Etsy listing",
description: "Returns listing title, price, currency, availability, shop, rating, and image. Heavy listings may need WEBCLAW_API_KEY for antibot.",
url_patterns: &[
"https://www.etsy.com/listing/{id}",
"https://www.etsy.com/listing/{id}/{slug}",
"https://www.etsy.com/{locale}/listing/{id}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if !is_etsy_host(host) {
return false;
}
parse_listing_id(url).is_some()
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let listing_id = parse_listing_id(url)
.ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?;
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
.await
.map_err(cloud_to_fetch_err)?;
let mut data = parse(&fetched.html, url, &listing_id);
if let Some(obj) = data.as_object_mut() {
obj.insert(
"data_source".into(),
match fetched.source {
cloud::FetchSource::Local => json!("local"),
cloud::FetchSource::Cloud => json!("cloud"),
},
);
}
Ok(data)
}
pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
let jsonld = find_product_jsonld(html);
let slug_title = humanise_slug(parse_slug(url).as_deref());
let title = jsonld
.as_ref()
.and_then(|v| get_text(v, "name"))
.or_else(|| og(html, "title").filter(|t| !is_generic_title(t)))
.or(slug_title);
let description = jsonld
.as_ref()
.and_then(|v| get_text(v, "description"))
.or_else(|| og(html, "description").filter(|d| !is_generic_description(d)));
let image = jsonld
.as_ref()
.and_then(get_first_image)
.or_else(|| og(html, "image"));
let brand = jsonld.as_ref().and_then(get_brand);
// Etsy listings often ship either a single Offer or an
// AggregateOffer when the listing has variants with different prices.
let offer = jsonld.as_ref().and_then(first_offer);
let (low_price, high_price, single_price) = match offer.as_ref() {
Some(o) => (
get_text(o, "lowPrice"),
get_text(o, "highPrice"),
get_text(o, "price"),
),
None => (None, None, None),
};
let currency = offer.as_ref().and_then(|o| get_text(o, "priceCurrency"));
let availability = offer
.as_ref()
.and_then(|o| get_text(o, "availability").map(strip_schema_prefix));
let item_condition = jsonld
.as_ref()
.and_then(|v| get_text(v, "itemCondition"))
.map(strip_schema_prefix);
// Shop name: offers[0].seller.name on newer listings, top-level
// `brand` on older listings (Etsy changed the schema around 2022).
// Fall back through both so either shape resolves.
let shop = offer
.as_ref()
.and_then(|o| {
o.get("seller")
.and_then(|s| s.get("name"))
.and_then(|n| n.as_str())
.map(String::from)
})
.or_else(|| brand.clone());
let shop_url = shop_url_from_html(html);
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
json!({
"url": url,
"listing_id": listing_id,
"title": title,
"description": description,
"image": image,
"brand": brand,
"price": single_price,
"low_price": low_price,
"high_price": high_price,
"currency": currency,
"availability": availability,
"item_condition": item_condition,
"shop": shop,
"shop_url": shop_url,
"aggregate_rating": aggregate_rating,
})
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn is_etsy_host(host: &str) -> bool {
host == "etsy.com" || host == "www.etsy.com" || host.ends_with(".etsy.com")
}
/// Extract the numeric listing id. Etsy ids are 9-11 digits today but
/// we accept any all-digit segment right after `/listing/`.
///
/// Handles `/listing/{id}`, `/listing/{id}/{slug}`, and the localised
/// `/{locale}/listing/{id}` shape (e.g. `/fr/listing/...`).
fn parse_listing_id(url: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r"/listing/(\d{6,})(?:[/?#]|$)").unwrap());
re.captures(url)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
/// Extract the URL slug after the listing id, e.g.
/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL
/// is the bare `/listing/{id}` shape.
fn parse_slug(url: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap());
re.captures(url)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
/// Turn a URL slug into a human-ish title:
/// `personalized-stainless-steel-tumbler` → `Personalized Stainless
/// Steel Tumbler`. Word-cap each dash-separated token; preserves
/// underscores as spaces too. Returns `None` on empty input.
fn humanise_slug(slug: Option<&str>) -> Option<String> {
let raw = slug?.trim();
if raw.is_empty() {
return None;
}
let words: Vec<String> = raw
.split(['-', '_'])
.filter(|w| !w.is_empty())
.map(capitalise_word)
.collect();
if words.is_empty() {
None
} else {
Some(words.join(" "))
}
}
fn capitalise_word(w: &str) -> String {
let mut chars = w.chars();
match chars.next() {
Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
None => String::new(),
}
}
/// True when the OG title is Etsy's fallback-page title rather than a
/// listing-specific title. Expired / region-blocked / antibot-filtered
/// pages return Etsy's sitewide tagline:
/// `"Etsy - Your place to buy and sell all things handmade..."`, or
/// simply `"etsy.com"`. A real listing title always starts with the
/// item name, never with "Etsy - " or the domain.
fn is_generic_title(t: &str) -> bool {
let normalised = t.trim().to_lowercase();
if matches!(
normalised.as_str(),
"etsy.com" | "etsy" | "www.etsy.com" | ""
) {
return true;
}
// Etsy's sitewide marketing tagline, served on 404 / blocked pages.
if normalised.starts_with("etsy - ")
|| normalised.starts_with("etsy.com - ")
|| normalised.starts_with("etsy uk - ")
{
return true;
}
// Etsy's "item unavailable" placeholder, served on delisted
// products. Keep the slug fallback so callers still see what the
// URL was about.
normalised.starts_with("this item is unavailable")
|| normalised.starts_with("sorry, this item is")
|| normalised == "item not available - etsy"
}
/// True when the OG description is an Etsy error-page placeholder or
/// sitewide marketing blurb rather than a real listing description.
fn is_generic_description(d: &str) -> bool {
let normalised = d.trim().to_lowercase();
if normalised.is_empty() {
return true;
}
normalised.starts_with("sorry, the page you were looking for")
|| normalised.starts_with("page not found")
|| normalised.starts_with("find the perfect handmade gift")
}
// ---------------------------------------------------------------------------
// JSON-LD walkers (same shape as ebay_listing; kept separate so the two
// extractors can diverge without cross-impact)
// ---------------------------------------------------------------------------
fn find_product_jsonld(html: &str) -> Option<Value> {
let blocks = webclaw_core::structured_data::extract_json_ld(html);
for b in blocks {
if let Some(found) = find_product_in(&b) {
return Some(found);
}
}
None
}
fn find_product_in(v: &Value) -> Option<Value> {
if is_product_type(v) {
return Some(v.clone());
}
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
for item in graph {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
if let Some(arr) = v.as_array() {
for item in arr {
if let Some(found) = find_product_in(item) {
return Some(found);
}
}
}
None
}
fn is_product_type(v: &Value) -> bool {
let Some(t) = v.get("@type") else {
return false;
};
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
match t {
Value::String(s) => is_prod(s),
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
_ => false,
}
}
fn get_text(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Number(n) => Some(n.to_string()),
_ => None,
})
}
fn get_brand(v: &Value) -> Option<String> {
let brand = v.get("brand")?;
if let Some(s) = brand.as_str() {
return Some(s.to_string());
}
brand
.as_object()
.and_then(|o| o.get("name"))
.and_then(|n| n.as_str())
.map(String::from)
}
fn get_first_image(v: &Value) -> Option<String> {
match v.get("image")? {
Value::String(s) => Some(s.clone()),
Value::Array(arr) => arr.iter().find_map(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}),
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}
}
fn first_offer(v: &Value) -> Option<Value> {
let offers = v.get("offers")?;
match offers {
Value::Array(arr) => arr.first().cloned(),
Value::Object(_) => Some(offers.clone()),
_ => None,
}
}
fn get_aggregate_rating(v: &Value) -> Option<Value> {
let r = v.get("aggregateRating")?;
Some(json!({
"rating_value": get_text(r, "ratingValue"),
"review_count": get_text(r, "reviewCount"),
"best_rating": get_text(r, "bestRating"),
}))
}
fn strip_schema_prefix(s: String) -> String {
s.replace("http://schema.org/", "")
.replace("https://schema.org/", "")
}
fn og(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
return c.get(2).map(|m| m.as_str().to_string());
}
}
None
}
/// Etsy links the owning shop with a canonical anchor like
/// `<a href="/shop/ShopName" ...>`. Grab the first one after the
/// breadcrumb boundary.
fn shop_url_from_html(html: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r#"href="(/shop/[A-Za-z0-9_-]+)""#).unwrap());
re.captures(html)
.and_then(|c| c.get(1))
.map(|m| format!("https://www.etsy.com{}", m.as_str()))
}
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
FetchError::Build(e.to_string())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_etsy_listing_urls() {
assert!(matches("https://www.etsy.com/listing/123456789"));
assert!(matches(
"https://www.etsy.com/listing/123456789/vintage-typewriter"
));
assert!(matches(
"https://www.etsy.com/fr/listing/123456789/vintage-typewriter"
));
assert!(!matches("https://www.etsy.com/"));
assert!(!matches("https://www.etsy.com/shop/SomeShop"));
assert!(!matches("https://example.com/listing/123456789"));
}
#[test]
fn parse_listing_id_handles_slug_and_locale() {
assert_eq!(
parse_listing_id("https://www.etsy.com/listing/123456789"),
Some("123456789".into())
);
assert_eq!(
parse_listing_id("https://www.etsy.com/listing/123456789/slug-here"),
Some("123456789".into())
);
assert_eq!(
parse_listing_id("https://www.etsy.com/fr/listing/123456789/slug"),
Some("123456789".into())
);
assert_eq!(
parse_listing_id("https://www.etsy.com/listing/123456789?ref=foo"),
Some("123456789".into())
);
}
#[test]
fn parse_extracts_from_fixture_jsonld() {
let html = r##"
<html><head>
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"Product",
"name":"Handmade Ceramic Mug","sku":"MUG-001",
"brand":{"@type":"Brand","name":"Studio Clay"},
"image":["https://i.etsystatic.com/abc.jpg","https://i.etsystatic.com/xyz.jpg"],
"itemCondition":"https://schema.org/NewCondition",
"offers":{"@type":"Offer","price":"24.00","priceCurrency":"USD",
"availability":"https://schema.org/InStock",
"seller":{"@type":"Organization","name":"StudioClay"}},
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.9","reviewCount":"127","bestRating":"5"}}
</script>
<a href="/shop/StudioClay" class="wt-text-link">StudioClay</a>
</head></html>"##;
let v = parse(html, "https://www.etsy.com/listing/1", "1");
assert_eq!(v["title"], "Handmade Ceramic Mug");
assert_eq!(v["price"], "24.00");
assert_eq!(v["currency"], "USD");
assert_eq!(v["availability"], "InStock");
assert_eq!(v["item_condition"], "NewCondition");
assert_eq!(v["shop"], "StudioClay");
assert_eq!(v["shop_url"], "https://www.etsy.com/shop/StudioClay");
assert_eq!(v["brand"], "Studio Clay");
assert_eq!(v["aggregate_rating"]["rating_value"], "4.9");
assert_eq!(v["aggregate_rating"]["review_count"], "127");
}
#[test]
fn parse_handles_aggregate_offer_price_range() {
let html = r##"
<script type="application/ld+json">
{"@type":"Product","name":"Mug Set",
"offers":{"@type":"AggregateOffer",
"lowPrice":"18.00","highPrice":"36.00","priceCurrency":"USD"}}
</script>
"##;
let v = parse(html, "https://www.etsy.com/listing/2", "2");
assert_eq!(v["low_price"], "18.00");
assert_eq!(v["high_price"], "36.00");
assert_eq!(v["currency"], "USD");
}
#[test]
fn parse_falls_back_to_og_when_no_jsonld() {
let html = r#"
<html><head>
<meta property="og:title" content="Minimal Fallback Item">
<meta property="og:description" content="OG-only extraction test.">
<meta property="og:image" content="https://i.etsystatic.com/fallback.jpg">
</head></html>"#;
let v = parse(html, "https://www.etsy.com/listing/3", "3");
assert_eq!(v["title"], "Minimal Fallback Item");
assert_eq!(v["description"], "OG-only extraction test.");
assert_eq!(v["image"], "https://i.etsystatic.com/fallback.jpg");
// No price fields when we only have OG.
assert!(v["price"].is_null());
}
#[test]
fn parse_slug_from_url() {
assert_eq!(
parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"),
Some("vintage-typewriter".into())
);
assert_eq!(
parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"),
Some("slug".into())
);
assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None);
assert_eq!(
parse_slug("https://www.etsy.com/fr/listing/123456789/slug"),
Some("slug".into())
);
}
#[test]
fn humanise_slug_capitalises_each_word() {
assert_eq!(
humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(),
Some("Personalized Stainless Steel Tumbler")
);
assert_eq!(
humanise_slug(Some("hand_crafted_mug")).as_deref(),
Some("Hand Crafted Mug")
);
assert_eq!(humanise_slug(Some("")), None);
assert_eq!(humanise_slug(None), None);
}
#[test]
fn is_generic_title_catches_common_shapes() {
assert!(is_generic_title("etsy.com"));
assert!(is_generic_title("Etsy"));
assert!(is_generic_title(" etsy.com "));
assert!(is_generic_title(
"Etsy - Your place to buy and sell all things handmade, vintage, and supplies"
));
assert!(is_generic_title("Etsy UK - Vintage & Handmade"));
assert!(!is_generic_title("Vintage Typewriter"));
assert!(!is_generic_title("Handmade Etsy-style Mug"));
}
#[test]
fn is_generic_description_catches_404_shapes() {
assert!(is_generic_description(""));
assert!(is_generic_description(
"Sorry, the page you were looking for was not found."
));
assert!(is_generic_description("Page not found"));
assert!(!is_generic_description(
"Hand-thrown ceramic mug, dishwasher safe."
));
}
#[test]
fn parse_uses_slug_when_og_is_generic() {
// Cloud-blocked Etsy listing: og:title is a site-wide generic
// placeholder, no JSON-LD, no description. Slug should win.
let html = r#"<html><head>
<meta property="og:title" content="etsy.com">
</head></html>"#;
let v = parse(
html,
"https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler",
"1079113183",
);
assert_eq!(v["title"], "Personalized Stainless Steel Tumbler");
}
#[test]
fn parse_prefers_real_og_over_slug() {
let html = r#"<html><head>
<meta property="og:title" content="Real Listing Title">
</head></html>"#;
let v = parse(
html,
"https://www.etsy.com/listing/1079113183/the-url-slug",
"1079113183",
);
assert_eq!(v["title"], "Real Listing Title");
}
}

View file

@ -0,0 +1,172 @@
//! GitHub issue structured extractor.
//!
//! Mirror of `github_pr` but on `/issues/{number}`. Uses
//! `api.github.com/repos/{owner}/{repo}/issues/{number}`. Returns the
//! issue body + comment count + labels + milestone + author /
//! assignees. Full per-comment bodies would be another call; kept for
//! a follow-up.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "github_issue",
label: "GitHub issue",
description: "Returns issue metadata: title, body, state, author, labels, assignees, milestone, comment count.",
url_patterns: &["https://github.com/{owner}/{repo}/issues/{number}"],
};
pub fn matches(url: &str) -> bool {
let host = url
.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("");
if host != "github.com" && host != "www.github.com" {
return false;
}
parse_issue(url).is_some()
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (owner, repo, number) = parse_issue(url).ok_or_else(|| {
FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'"))
})?;
let api_url = format!("https://api.github.com/repos/{owner}/{repo}/issues/{number}");
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"github_issue: issue '{owner}/{repo}#{number}' not found"
)));
}
if resp.status == 403 {
return Err(FetchError::Build(
"github_issue: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"github api returned status {}",
resp.status
)));
}
let issue: Issue = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("github issue parse: {e}")))?;
// The same endpoint returns PRs too; reject if we got one so the caller
// uses /v1/scrape/github_pr instead of getting a half-shaped payload.
if issue.pull_request.is_some() {
return Err(FetchError::Build(format!(
"github_issue: '{owner}/{repo}#{number}' is a pull request, use /v1/scrape/github_pr"
)));
}
Ok(json!({
"url": url,
"owner": owner,
"repo": repo,
"number": issue.number,
"title": issue.title,
"body": issue.body,
"state": issue.state,
"state_reason":issue.state_reason,
"author": issue.user.as_ref().and_then(|u| u.login.clone()),
"labels": issue.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
"assignees": issue.assignees.iter().filter_map(|u| u.login.clone()).collect::<Vec<_>>(),
"milestone": issue.milestone.as_ref().and_then(|m| m.title.clone()),
"comments": issue.comments,
"locked": issue.locked,
"created_at": issue.created_at,
"updated_at": issue.updated_at,
"closed_at": issue.closed_at,
"html_url": issue.html_url,
}))
}
fn parse_issue(url: &str) -> Option<(String, String, u64)> {
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
if segs.len() < 4 || segs[2] != "issues" {
return None;
}
let number: u64 = segs[3].parse().ok()?;
Some((segs[0].to_string(), segs[1].to_string(), number))
}
// ---------------------------------------------------------------------------
// GitHub issue API types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct Issue {
number: Option<i64>,
title: Option<String>,
body: Option<String>,
state: Option<String>,
state_reason: Option<String>,
locked: Option<bool>,
comments: Option<i64>,
created_at: Option<String>,
updated_at: Option<String>,
closed_at: Option<String>,
html_url: Option<String>,
user: Option<UserRef>,
#[serde(default)]
labels: Vec<LabelRef>,
#[serde(default)]
assignees: Vec<UserRef>,
milestone: Option<Milestone>,
/// Present when this "issue" is actually a pull request. The REST
/// API overloads the issues endpoint for PRs.
pull_request: Option<serde_json::Value>,
}
#[derive(Deserialize)]
struct UserRef {
login: Option<String>,
}
#[derive(Deserialize)]
struct LabelRef {
name: Option<String>,
}
#[derive(Deserialize)]
struct Milestone {
title: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_issue_urls() {
assert!(matches("https://github.com/rust-lang/rust/issues/100"));
assert!(matches("https://github.com/rust-lang/rust/issues/100/"));
assert!(!matches("https://github.com/rust-lang/rust"));
assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
assert!(!matches("https://github.com/rust-lang/rust/issues"));
}
#[test]
fn parse_issue_extracts_owner_repo_number() {
assert_eq!(
parse_issue("https://github.com/rust-lang/rust/issues/100"),
Some(("rust-lang".into(), "rust".into(), 100))
);
assert_eq!(
parse_issue("https://github.com/rust-lang/rust/issues/100/?foo=bar"),
Some(("rust-lang".into(), "rust".into(), 100))
);
}
}

View file

@ -0,0 +1,189 @@
//! GitHub pull request structured extractor.
//!
//! Uses `api.github.com/repos/{owner}/{repo}/pulls/{number}`. Returns
//! the PR metadata + a counted summary of comments and review activity.
//! Full diff and per-comment bodies require additional calls — left for
//! a follow-up enhancement so the v1 stays one network round-trip.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "github_pr",
label: "GitHub pull request",
description: "Returns PR metadata: title, body, state, author, labels, additions/deletions, file count.",
url_patterns: &["https://github.com/{owner}/{repo}/pull/{number}"],
};
pub fn matches(url: &str) -> bool {
let host = url
.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("");
if host != "github.com" && host != "www.github.com" {
return false;
}
parse_pr(url).is_some()
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (owner, repo, number) = parse_pr(url).ok_or_else(|| {
FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'"))
})?;
let api_url = format!("https://api.github.com/repos/{owner}/{repo}/pulls/{number}");
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"github_pr: pull request '{owner}/{repo}#{number}' not found"
)));
}
if resp.status == 403 {
return Err(FetchError::Build(
"github_pr: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"github api returned status {}",
resp.status
)));
}
let p: PullRequest = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("github pr parse: {e}")))?;
Ok(json!({
"url": url,
"owner": owner,
"repo": repo,
"number": p.number,
"title": p.title,
"body": p.body,
"state": p.state,
"draft": p.draft,
"merged": p.merged,
"merged_at": p.merged_at,
"merge_commit_sha": p.merge_commit_sha,
"author": p.user.as_ref().and_then(|u| u.login.clone()),
"labels": p.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
"milestone": p.milestone.as_ref().and_then(|m| m.title.clone()),
"head_ref": p.head.as_ref().and_then(|r| r.ref_name.clone()),
"base_ref": p.base.as_ref().and_then(|r| r.ref_name.clone()),
"head_sha": p.head.as_ref().and_then(|r| r.sha.clone()),
"additions": p.additions,
"deletions": p.deletions,
"changed_files": p.changed_files,
"commits": p.commits,
"comments": p.comments,
"review_comments":p.review_comments,
"created_at": p.created_at,
"updated_at": p.updated_at,
"closed_at": p.closed_at,
"html_url": p.html_url,
}))
}
fn parse_pr(url: &str) -> Option<(String, String, u64)> {
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
// /{owner}/{repo}/pull/{number} (or /pulls/{number} variant)
if segs.len() < 4 {
return None;
}
if segs[2] != "pull" && segs[2] != "pulls" {
return None;
}
let number: u64 = segs[3].parse().ok()?;
Some((segs[0].to_string(), segs[1].to_string(), number))
}
// ---------------------------------------------------------------------------
// GitHub PR API types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct PullRequest {
number: Option<i64>,
title: Option<String>,
body: Option<String>,
state: Option<String>,
draft: Option<bool>,
merged: Option<bool>,
merged_at: Option<String>,
merge_commit_sha: Option<String>,
user: Option<UserRef>,
#[serde(default)]
labels: Vec<LabelRef>,
milestone: Option<Milestone>,
head: Option<GitRef>,
base: Option<GitRef>,
additions: Option<i64>,
deletions: Option<i64>,
changed_files: Option<i64>,
commits: Option<i64>,
comments: Option<i64>,
review_comments: Option<i64>,
created_at: Option<String>,
updated_at: Option<String>,
closed_at: Option<String>,
html_url: Option<String>,
}
#[derive(Deserialize)]
struct UserRef {
login: Option<String>,
}
#[derive(Deserialize)]
struct LabelRef {
name: Option<String>,
}
#[derive(Deserialize)]
struct Milestone {
title: Option<String>,
}
#[derive(Deserialize)]
struct GitRef {
#[serde(rename = "ref")]
ref_name: Option<String>,
sha: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_pr_urls() {
assert!(matches("https://github.com/rust-lang/rust/pull/12345"));
assert!(matches(
"https://github.com/rust-lang/rust/pull/12345/files"
));
assert!(!matches("https://github.com/rust-lang/rust"));
assert!(!matches("https://github.com/rust-lang/rust/issues/100"));
assert!(!matches("https://github.com/rust-lang"));
}
#[test]
fn parse_pr_extracts_owner_repo_number() {
assert_eq!(
parse_pr("https://github.com/rust-lang/rust/pull/12345"),
Some(("rust-lang".into(), "rust".into(), 12345))
);
assert_eq!(
parse_pr("https://github.com/rust-lang/rust/pull/12345/files"),
Some(("rust-lang".into(), "rust".into(), 12345))
);
}
}

View file

@ -0,0 +1,179 @@
//! GitHub release structured extractor.
//!
//! `api.github.com/repos/{owner}/{repo}/releases/tags/{tag}`. Returns
//! the release notes body, asset list with download counts, and
//! prerelease flag.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "github_release",
label: "GitHub release",
description: "Returns release metadata: tag, name, body (release notes), assets with download counts.",
url_patterns: &["https://github.com/{owner}/{repo}/releases/tag/{tag}"],
};
pub fn matches(url: &str) -> bool {
let host = url
.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("");
if host != "github.com" && host != "www.github.com" {
return false;
}
parse_release(url).is_some()
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (owner, repo, tag) = parse_release(url).ok_or_else(|| {
FetchError::Build(format!("github_release: cannot parse release URL '{url}'"))
})?;
let api_url = format!("https://api.github.com/repos/{owner}/{repo}/releases/tags/{tag}");
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"github_release: release '{owner}/{repo}@{tag}' not found"
)));
}
if resp.status == 403 {
return Err(FetchError::Build(
"github_release: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour."
.into(),
));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"github api returned status {}",
resp.status
)));
}
let r: Release = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("github release parse: {e}")))?;
let assets: Vec<Value> = r
.assets
.iter()
.map(|a| {
json!({
"name": a.name,
"size": a.size,
"download_count": a.download_count,
"browser_download_url": a.browser_download_url,
"content_type": a.content_type,
"created_at": a.created_at,
"updated_at": a.updated_at,
})
})
.collect();
Ok(json!({
"url": url,
"owner": owner,
"repo": repo,
"tag_name": r.tag_name,
"name": r.name,
"body": r.body,
"draft": r.draft,
"prerelease": r.prerelease,
"author": r.author.as_ref().and_then(|u| u.login.clone()),
"created_at": r.created_at,
"published_at": r.published_at,
"asset_count": assets.len(),
"total_downloads": r.assets.iter().map(|a| a.download_count.unwrap_or(0)).sum::<i64>(),
"assets": assets,
"html_url": r.html_url,
}))
}
fn parse_release(url: &str) -> Option<(String, String, String)> {
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
// /{owner}/{repo}/releases/tag/{tag}
if segs.len() < 5 {
return None;
}
if segs[2] != "releases" || segs[3] != "tag" {
return None;
}
Some((
segs[0].to_string(),
segs[1].to_string(),
segs[4].to_string(),
))
}
// ---------------------------------------------------------------------------
// GitHub Release API types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct Release {
tag_name: Option<String>,
name: Option<String>,
body: Option<String>,
draft: Option<bool>,
prerelease: Option<bool>,
author: Option<UserRef>,
created_at: Option<String>,
published_at: Option<String>,
html_url: Option<String>,
#[serde(default)]
assets: Vec<Asset>,
}
#[derive(Deserialize)]
struct UserRef {
login: Option<String>,
}
#[derive(Deserialize)]
struct Asset {
name: Option<String>,
size: Option<i64>,
download_count: Option<i64>,
browser_download_url: Option<String>,
content_type: Option<String>,
created_at: Option<String>,
updated_at: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_release_urls() {
assert!(matches(
"https://github.com/rust-lang/rust/releases/tag/1.85.0"
));
assert!(matches(
"https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"
));
assert!(!matches("https://github.com/rust-lang/rust"));
assert!(!matches("https://github.com/rust-lang/rust/releases"));
assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
}
#[test]
fn parse_release_extracts_owner_repo_tag() {
assert_eq!(
parse_release("https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"),
Some(("0xMassi".into(), "webclaw".into(), "v0.4.0".into()))
);
assert_eq!(
parse_release("https://github.com/rust-lang/rust/releases/tag/1.85.0/?foo=bar"),
Some(("rust-lang".into(), "rust".into(), "1.85.0".into()))
);
}
}

View file

@ -0,0 +1,212 @@
//! GitHub repository structured extractor.
//!
//! Uses GitHub's public REST API at `api.github.com/repos/{owner}/{repo}`.
//! Unauthenticated requests get 60/hour per IP, which is fine for users
//! self-hosting and for low-volume cloud usage. Production cloud should
//! set a `GITHUB_TOKEN` to lift to 5,000/hour, but the extractor doesn't
//! depend on it being set — it works open out of the box.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "github_repo",
label: "GitHub repository",
description: "Returns repo metadata: stars, forks, topics, license, default branch, recent activity.",
url_patterns: &["https://github.com/{owner}/{repo}"],
};
pub fn matches(url: &str) -> bool {
let host = url
.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("");
if host != "github.com" && host != "www.github.com" {
return false;
}
// Path must be exactly /{owner}/{repo} (or with trailing slash). Reject
// sub-pages (issues, pulls, blob, etc.) so we don't claim URLs the
// future github_issue / github_pr extractors will handle.
let path = url
.split("://")
.nth(1)
.and_then(|s| s.split_once('/'))
.map(|(_, p)| p)
.unwrap_or("");
let stripped = path
.split(['?', '#'])
.next()
.unwrap_or("")
.trim_end_matches('/');
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
segs.len() == 2 && !RESERVED_OWNERS.contains(&segs[0])
}
/// GitHub uses some top-level paths for non-repo pages.
const RESERVED_OWNERS: &[&str] = &[
"settings",
"marketplace",
"explore",
"topics",
"trending",
"collections",
"events",
"sponsors",
"issues",
"pulls",
"notifications",
"new",
"organizations",
"login",
"join",
"search",
"about",
];
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (owner, repo) = parse_owner_repo(url).ok_or_else(|| {
FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'"))
})?;
let api_url = format!("https://api.github.com/repos/{owner}/{repo}");
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"github_repo: repo '{owner}/{repo}' not found"
)));
}
if resp.status == 403 {
return Err(FetchError::Build(
"github_repo: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"github api returned status {}",
resp.status
)));
}
let r: Repo = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("github api parse: {e}")))?;
Ok(json!({
"url": url,
"owner": r.owner.as_ref().map(|o| &o.login),
"name": r.name,
"full_name": r.full_name,
"description": r.description,
"homepage": r.homepage,
"language": r.language,
"topics": r.topics,
"license": r.license.as_ref().and_then(|l| l.spdx_id.clone()),
"license_name": r.license.as_ref().map(|l| l.name.clone()),
"default_branch": r.default_branch,
"stars": r.stargazers_count,
"forks": r.forks_count,
"watchers": r.subscribers_count,
"open_issues": r.open_issues_count,
"size_kb": r.size,
"archived": r.archived,
"fork": r.fork,
"is_template": r.is_template,
"has_issues": r.has_issues,
"has_wiki": r.has_wiki,
"has_pages": r.has_pages,
"has_discussions": r.has_discussions,
"created_at": r.created_at,
"updated_at": r.updated_at,
"pushed_at": r.pushed_at,
"html_url": r.html_url,
}))
}
fn parse_owner_repo(url: &str) -> Option<(String, String)> {
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
let owner = segs.next()?.to_string();
let repo = segs.next()?.to_string();
Some((owner, repo))
}
// ---------------------------------------------------------------------------
// GitHub API types — only the fields we surface
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct Repo {
name: Option<String>,
full_name: Option<String>,
description: Option<String>,
homepage: Option<String>,
language: Option<String>,
#[serde(default)]
topics: Vec<String>,
license: Option<License>,
default_branch: Option<String>,
stargazers_count: Option<i64>,
forks_count: Option<i64>,
subscribers_count: Option<i64>,
open_issues_count: Option<i64>,
size: Option<i64>,
archived: Option<bool>,
fork: Option<bool>,
is_template: Option<bool>,
has_issues: Option<bool>,
has_wiki: Option<bool>,
has_pages: Option<bool>,
has_discussions: Option<bool>,
created_at: Option<String>,
updated_at: Option<String>,
pushed_at: Option<String>,
html_url: Option<String>,
owner: Option<Owner>,
}
#[derive(Deserialize)]
struct Owner {
login: String,
}
#[derive(Deserialize)]
struct License {
name: String,
spdx_id: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_repo_root_only() {
assert!(matches("https://github.com/rust-lang/rust"));
assert!(matches("https://github.com/rust-lang/rust/"));
assert!(!matches("https://github.com/rust-lang/rust/issues"));
assert!(!matches("https://github.com/rust-lang/rust/pulls/123"));
assert!(!matches("https://github.com/rust-lang"));
assert!(!matches("https://github.com/marketplace"));
assert!(!matches("https://github.com/topics/rust"));
assert!(!matches("https://example.com/foo/bar"));
}
#[test]
fn parse_owner_repo_handles_trailing_slash_and_query() {
assert_eq!(
parse_owner_repo("https://github.com/rust-lang/rust"),
Some(("rust-lang".into(), "rust".into()))
);
assert_eq!(
parse_owner_repo("https://github.com/rust-lang/rust/?tab=foo"),
Some(("rust-lang".into(), "rust".into()))
);
}
}

View file

@ -0,0 +1,186 @@
//! Hacker News structured extractor.
//!
//! Uses Algolia's HN API (`hn.algolia.com/api/v1/items/{id}`) which
//! returns the full post + recursive comment tree in a single request.
//! The official Firebase API at `hacker-news.firebaseio.com` requires
//! N+1 fetches per comment, so we'd hit either timeout or rate-limit
//! on any non-trivial thread.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "hackernews",
label: "Hacker News story",
description: "Returns post + nested comment tree for a Hacker News item.",
url_patterns: &[
"https://news.ycombinator.com/item?id=N",
"https://hn.algolia.com/items/N",
],
};
pub fn matches(url: &str) -> bool {
let host = url
.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("");
if host == "news.ycombinator.com" {
return url.contains("item?id=") || url.contains("item%3Fid=");
}
if host == "hn.algolia.com" {
return url.contains("/items/");
}
false
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let id = parse_item_id(url).ok_or_else(|| {
FetchError::Build(format!("hackernews: cannot parse item id from '{url}'"))
})?;
let api_url = format!("https://hn.algolia.com/api/v1/items/{id}");
let resp = client.fetch(&api_url).await?;
if resp.status != 200 {
return Err(FetchError::Build(format!(
"hn algolia returned status {}",
resp.status
)));
}
let item: AlgoliaItem = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("hn algolia parse: {e}")))?;
let post = post_json(&item);
let comments: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
Ok(json!({
"url": url,
"post": post,
"comments": comments,
}))
}
// ---------------------------------------------------------------------------
// Helpers
// ---------------------------------------------------------------------------
/// Pull the numeric id out of a HN URL. Handles `item?id=N` and the
/// Algolia mirror's `/items/N` form.
fn parse_item_id(url: &str) -> Option<u64> {
if let Some(after) = url.split("id=").nth(1) {
let n = after.split('&').next().unwrap_or(after);
if let Ok(id) = n.parse::<u64>() {
return Some(id);
}
}
if let Some(after) = url.split("/items/").nth(1) {
let n = after.split(['/', '?', '#']).next().unwrap_or(after);
if let Ok(id) = n.parse::<u64>() {
return Some(id);
}
}
None
}
fn post_json(item: &AlgoliaItem) -> Value {
json!({
"id": item.id,
"type": item.r#type,
"title": item.title,
"url": item.url,
"author": item.author,
"points": item.points,
"text": item.text, // populated for ask/show/tell
"created_at": item.created_at,
"created_at_unix": item.created_at_i,
"comment_count": count_descendants(item),
"permalink": item.id.map(|i| format!("https://news.ycombinator.com/item?id={i}")),
})
}
fn comment_json(item: &AlgoliaItem) -> Option<Value> {
if !matches!(item.r#type.as_deref(), Some("comment")) {
return None;
}
// Dead/deleted comments still appear in the tree; surface them honestly.
let replies: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
Some(json!({
"id": item.id,
"author": item.author,
"text": item.text,
"created_at": item.created_at,
"created_at_unix": item.created_at_i,
"parent_id": item.parent_id,
"story_id": item.story_id,
"replies": replies,
}))
}
fn count_descendants(item: &AlgoliaItem) -> usize {
item.children
.iter()
.filter(|c| matches!(c.r#type.as_deref(), Some("comment")))
.map(|c| 1 + count_descendants(c))
.sum()
}
// ---------------------------------------------------------------------------
// Algolia API types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct AlgoliaItem {
id: Option<u64>,
r#type: Option<String>,
title: Option<String>,
url: Option<String>,
author: Option<String>,
points: Option<i64>,
text: Option<String>,
created_at: Option<String>,
created_at_i: Option<i64>,
parent_id: Option<u64>,
story_id: Option<u64>,
#[serde(default)]
children: Vec<AlgoliaItem>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_hn_item_urls() {
assert!(matches("https://news.ycombinator.com/item?id=1"));
assert!(matches("https://news.ycombinator.com/item?id=12345"));
assert!(matches("https://hn.algolia.com/items/1"));
}
#[test]
fn rejects_non_item_urls() {
assert!(!matches("https://news.ycombinator.com/"));
assert!(!matches("https://news.ycombinator.com/news"));
assert!(!matches("https://example.com/item?id=1"));
}
#[test]
fn parse_item_id_handles_both_forms() {
assert_eq!(
parse_item_id("https://news.ycombinator.com/item?id=1"),
Some(1)
);
assert_eq!(
parse_item_id("https://news.ycombinator.com/item?id=12345&p=2"),
Some(12345)
);
assert_eq!(parse_item_id("https://hn.algolia.com/items/999"), Some(999));
assert_eq!(parse_item_id("https://example.com/foo"), None);
}
}

View file

@ -0,0 +1,189 @@
//! HuggingFace dataset structured extractor.
//!
//! Same shape as the model extractor but hits the dataset endpoint.
//! `huggingface.co/api/datasets/{owner}/{name}`.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "huggingface_dataset",
label: "HuggingFace dataset",
description: "Returns dataset metadata: downloads, likes, license, language, task categories, file list.",
url_patterns: &["https://huggingface.co/datasets/{owner}/{name}"],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host != "huggingface.co" && host != "www.huggingface.co" {
return false;
}
let path = url
.split("://")
.nth(1)
.and_then(|s| s.split_once('/'))
.map(|(_, p)| p)
.unwrap_or("");
let stripped = path
.split(['?', '#'])
.next()
.unwrap_or("")
.trim_end_matches('/');
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
// /datasets/{name} (legacy top-level) or /datasets/{owner}/{name} (canonical).
segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3)
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let dataset_path = parse_dataset_path(url).ok_or_else(|| {
FetchError::Build(format!(
"hf_dataset: cannot parse dataset path from '{url}'"
))
})?;
let api_url = format!("https://huggingface.co/api/datasets/{dataset_path}");
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"hf_dataset: '{dataset_path}' not found"
)));
}
if resp.status == 401 {
return Err(FetchError::Build(format!(
"hf_dataset: '{dataset_path}' requires authentication (gated)"
)));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"hf_dataset api returned status {}",
resp.status
)));
}
let d: DatasetInfo = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("hf_dataset parse: {e}")))?;
let files: Vec<Value> = d
.siblings
.iter()
.map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
.collect();
Ok(json!({
"url": url,
"id": d.id,
"private": d.private,
"gated": d.gated,
"downloads": d.downloads,
"downloads_30d": d.downloads_all_time,
"likes": d.likes,
"tags": d.tags,
"license": d.card_data.as_ref().and_then(|c| c.license.clone()),
"language": d.card_data.as_ref().and_then(|c| c.language.clone()),
"task_categories": d.card_data.as_ref().and_then(|c| c.task_categories.clone()),
"size_categories": d.card_data.as_ref().and_then(|c| c.size_categories.clone()),
"annotations_creators": d.card_data.as_ref().and_then(|c| c.annotations_creators.clone()),
"configs": d.card_data.as_ref().and_then(|c| c.configs.clone()),
"created_at": d.created_at,
"last_modified": d.last_modified,
"sha": d.sha,
"file_count": d.siblings.len(),
"files": files,
}))
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
/// Returns the part to append to the API URL — either `name` (legacy
/// top-level dataset like `squad`) or `owner/name` (canonical form).
fn parse_dataset_path(url: &str) -> Option<String> {
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
if segs.next() != Some("datasets") {
return None;
}
let first = segs.next()?.to_string();
match segs.next() {
Some(second) => Some(format!("{first}/{second}")),
None => Some(first),
}
}
#[derive(Deserialize)]
struct DatasetInfo {
id: Option<String>,
private: Option<bool>,
gated: Option<serde_json::Value>,
downloads: Option<i64>,
#[serde(rename = "downloadsAllTime")]
downloads_all_time: Option<i64>,
likes: Option<i64>,
#[serde(default)]
tags: Vec<String>,
#[serde(rename = "createdAt")]
created_at: Option<String>,
#[serde(rename = "lastModified")]
last_modified: Option<String>,
sha: Option<String>,
#[serde(rename = "cardData")]
card_data: Option<DatasetCard>,
#[serde(default)]
siblings: Vec<Sibling>,
}
#[derive(Deserialize)]
struct DatasetCard {
license: Option<serde_json::Value>,
language: Option<serde_json::Value>,
task_categories: Option<serde_json::Value>,
size_categories: Option<serde_json::Value>,
annotations_creators: Option<serde_json::Value>,
configs: Option<serde_json::Value>,
}
#[derive(Deserialize)]
struct Sibling {
rfilename: String,
size: Option<i64>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_dataset_pages() {
assert!(matches("https://huggingface.co/datasets/squad")); // legacy top-level
assert!(matches("https://huggingface.co/datasets/openai/gsm8k")); // canonical owner/name
assert!(!matches("https://huggingface.co/openai/whisper-large-v3"));
assert!(!matches("https://huggingface.co/datasets/"));
}
#[test]
fn parse_dataset_path_works() {
assert_eq!(
parse_dataset_path("https://huggingface.co/datasets/squad"),
Some("squad".into())
);
assert_eq!(
parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k"),
Some("openai/gsm8k".into())
);
assert_eq!(
parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k/?lib=transformers"),
Some("openai/gsm8k".into())
);
}
}

View file

@ -0,0 +1,223 @@
//! HuggingFace model card structured extractor.
//!
//! Uses the public model API at `huggingface.co/api/models/{owner}/{name}`.
//! Returns metadata + the parsed model card front matter, but does not
//! pull the full README body — those are sometimes 100KB+ and the user
//! can hit /v1/scrape if they want it as markdown.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "huggingface_model",
label: "HuggingFace model",
description: "Returns model metadata: downloads, likes, license, pipeline tag, library name, file list.",
url_patterns: &["https://huggingface.co/{owner}/{name}"],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host != "huggingface.co" && host != "www.huggingface.co" {
return false;
}
let path = url
.split("://")
.nth(1)
.and_then(|s| s.split_once('/'))
.map(|(_, p)| p)
.unwrap_or("");
let stripped = path
.split(['?', '#'])
.next()
.unwrap_or("")
.trim_end_matches('/');
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
// /{owner}/{name} but reject HF-internal sections + sub-pages.
if segs.len() != 2 {
return false;
}
!RESERVED_NAMESPACES.contains(&segs[0])
}
const RESERVED_NAMESPACES: &[&str] = &[
"datasets",
"spaces",
"blog",
"docs",
"api",
"models",
"papers",
"pricing",
"tasks",
"join",
"login",
"settings",
"organizations",
"new",
"search",
];
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (owner, name) = parse_owner_name(url).ok_or_else(|| {
FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'"))
})?;
let api_url = format!("https://huggingface.co/api/models/{owner}/{name}");
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"hf model: '{owner}/{name}' not found"
)));
}
if resp.status == 401 {
return Err(FetchError::Build(format!(
"hf model: '{owner}/{name}' requires authentication (gated repo)"
)));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"hf api returned status {}",
resp.status
)));
}
let m: ModelInfo = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("hf api parse: {e}")))?;
// Surface a flat file list — full siblings can be hundreds of entries
// for big repos. We keep it as-is because callers want to know about
// every shard; if it bloats responses too much we'll add pagination.
let files: Vec<Value> = m
.siblings
.iter()
.map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
.collect();
Ok(json!({
"url": url,
"id": m.id,
"model_id": m.model_id,
"private": m.private,
"gated": m.gated,
"downloads": m.downloads,
"downloads_30d": m.downloads_all_time,
"likes": m.likes,
"library_name": m.library_name,
"pipeline_tag": m.pipeline_tag,
"tags": m.tags,
"license": m.card_data.as_ref().and_then(|c| c.license.clone()),
"language": m.card_data.as_ref().and_then(|c| c.language.clone()),
"datasets": m.card_data.as_ref().and_then(|c| c.datasets.clone()),
"base_model": m.card_data.as_ref().and_then(|c| c.base_model.clone()),
"model_type": m.card_data.as_ref().and_then(|c| c.model_type.clone()),
"created_at": m.created_at,
"last_modified": m.last_modified,
"sha": m.sha,
"file_count": m.siblings.len(),
"files": files,
}))
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn parse_owner_name(url: &str) -> Option<(String, String)> {
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
let owner = segs.next()?.to_string();
let name = segs.next()?.to_string();
Some((owner, name))
}
// ---------------------------------------------------------------------------
// HF API types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct ModelInfo {
id: Option<String>,
#[serde(rename = "modelId")]
model_id: Option<String>,
private: Option<bool>,
gated: Option<serde_json::Value>, // bool or string ("auto" / "manual" / false)
downloads: Option<i64>,
#[serde(rename = "downloadsAllTime")]
downloads_all_time: Option<i64>,
likes: Option<i64>,
#[serde(rename = "library_name")]
library_name: Option<String>,
#[serde(rename = "pipeline_tag")]
pipeline_tag: Option<String>,
#[serde(default)]
tags: Vec<String>,
#[serde(rename = "createdAt")]
created_at: Option<String>,
#[serde(rename = "lastModified")]
last_modified: Option<String>,
sha: Option<String>,
#[serde(rename = "cardData")]
card_data: Option<CardData>,
#[serde(default)]
siblings: Vec<Sibling>,
}
#[derive(Deserialize)]
struct CardData {
license: Option<serde_json::Value>, // string or array
language: Option<serde_json::Value>,
datasets: Option<serde_json::Value>,
#[serde(rename = "base_model")]
base_model: Option<serde_json::Value>,
#[serde(rename = "model_type")]
model_type: Option<String>,
}
#[derive(Deserialize)]
struct Sibling {
rfilename: String,
size: Option<i64>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_model_pages() {
assert!(matches("https://huggingface.co/meta-llama/Meta-Llama-3-8B"));
assert!(matches("https://huggingface.co/openai/whisper-large-v3"));
assert!(matches("https://huggingface.co/bert-base-uncased/main")); // owner=bert-base-uncased name=main: false positive but acceptable for v1
}
#[test]
fn rejects_hf_section_pages() {
assert!(!matches("https://huggingface.co/datasets/squad"));
assert!(!matches("https://huggingface.co/spaces/foo/bar"));
assert!(!matches("https://huggingface.co/blog/intro"));
assert!(!matches("https://huggingface.co/"));
assert!(!matches("https://huggingface.co/meta-llama"));
}
#[test]
fn parse_owner_name_pulls_both() {
assert_eq!(
parse_owner_name("https://huggingface.co/meta-llama/Meta-Llama-3-8B"),
Some(("meta-llama".into(), "Meta-Llama-3-8B".into()))
);
assert_eq!(
parse_owner_name("https://huggingface.co/openai/whisper-large-v3?library=transformers"),
Some(("openai".into(), "whisper-large-v3".into()))
);
}
}

View file

@ -0,0 +1,235 @@
//! Instagram post structured extractor.
//!
//! Uses Instagram's public embed endpoint
//! `/p/{shortcode}/embed/captioned/` which returns SSR HTML with the
//! full caption, author username, and thumbnail. No auth required.
//! The same endpoint serves reels and IGTV under `/reel/{code}` and
//! `/tv/{code}` URLs (we accept all three).
use regex::Regex;
use serde_json::{Value, json};
use std::sync::OnceLock;
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "instagram_post",
label: "Instagram post",
description: "Returns full caption, author username, thumbnail, and post type (post / reel / tv) via Instagram's public embed.",
url_patterns: &[
"https://www.instagram.com/p/{shortcode}/",
"https://www.instagram.com/reel/{shortcode}/",
"https://www.instagram.com/tv/{shortcode}/",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if !matches!(host, "www.instagram.com" | "instagram.com") {
return false;
}
parse_shortcode(url).is_some()
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| {
FetchError::Build(format!(
"instagram_post: cannot parse shortcode from '{url}'"
))
})?;
// Instagram serves the same embed HTML for posts/reels/tv under /p/.
let embed_url = format!("https://www.instagram.com/p/{shortcode}/embed/captioned/");
let resp = client.fetch(&embed_url).await?;
if resp.status != 200 {
return Err(FetchError::Build(format!(
"instagram embed returned status {} for {shortcode}",
resp.status
)));
}
let html = &resp.html;
let username = parse_username(html);
let caption = parse_caption(html);
let thumbnail = parse_thumbnail(html);
Ok(json!({
"url": url,
"embed_url": embed_url,
"shortcode": shortcode,
"kind": kind,
"data_completeness": "embed",
"author_username": username,
"caption": caption,
"thumbnail_url": thumbnail,
"canonical_url": format!("https://www.instagram.com/{}/{shortcode}/", path_segment_for(kind)),
}))
}
// ---------------------------------------------------------------------------
// URL parsing
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
/// Returns `(kind, shortcode)` where kind ∈ {`post`, `reel`, `tv`}.
fn parse_shortcode(url: &str) -> Option<(&'static str, String)> {
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
let first = segs.next()?;
let kind = match first {
"p" => "post",
"reel" | "reels" => "reel",
"tv" => "tv",
_ => return None,
};
let shortcode = segs.next()?;
if shortcode.is_empty() {
return None;
}
Some((kind, shortcode.to_string()))
}
fn path_segment_for(kind: &str) -> &'static str {
match kind {
"reel" => "reel",
"tv" => "tv",
_ => "p",
}
}
// ---------------------------------------------------------------------------
// HTML scraping
// ---------------------------------------------------------------------------
/// Username appears as the anchor text inside `<a class="CaptionUsername">`.
fn parse_username(html: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r#"(?s)class="CaptionUsername"[^>]*>([^<]+)<"#).unwrap());
re.captures(html)
.and_then(|c| c.get(1))
.map(|m| html_decode(m.as_str().trim()))
}
/// Caption sits inside `<div class="Caption">` after the username anchor.
/// We grab the whole Caption block and strip out the username link, time
/// node, and any trailing "Photo by" / "View ... on Instagram" boilerplate.
fn parse_caption(html: &str) -> Option<String> {
static RE_OUTER: OnceLock<Regex> = OnceLock::new();
let outer = RE_OUTER
.get_or_init(|| Regex::new(r#"(?s)<div\s+class="Caption"[^>]*>(.*?)</div>"#).unwrap());
let block = outer.captures(html)?.get(1)?.as_str();
// Strip everything wrapped in <a class="CaptionUsername">...</a>.
static RE_USER: OnceLock<Regex> = OnceLock::new();
let user_re = RE_USER
.get_or_init(|| Regex::new(r#"(?s)<a[^>]*class="CaptionUsername"[^>]*>.*?</a>"#).unwrap());
let stripped = user_re.replace_all(block, "");
// Then strip anything remaining tagged.
static RE_TAGS: OnceLock<Regex> = OnceLock::new();
let tag_re = RE_TAGS.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
let text = tag_re.replace_all(&stripped, " ");
let cleaned = collapse_whitespace(&html_decode(text.trim()));
if cleaned.is_empty() {
None
} else {
Some(cleaned)
}
}
/// Thumbnail is the `<img class="EmbeddedMediaImage">` inside the embed
/// (or the og:image as fallback).
fn parse_thumbnail(html: &str) -> Option<String> {
static RE_IMG: OnceLock<Regex> = OnceLock::new();
let img_re = RE_IMG.get_or_init(|| {
Regex::new(r#"(?s)<img[^>]+class="[^"]*EmbeddedMediaImage[^"]*"[^>]+src="([^"]+)""#)
.unwrap()
});
if let Some(m) = img_re.captures(html).and_then(|c| c.get(1)) {
return Some(html_decode(m.as_str()));
}
static RE_OG: OnceLock<Regex> = OnceLock::new();
let og_re = RE_OG.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:image"[^>]+content="([^"]+)""#).unwrap()
});
og_re
.captures(html)
.and_then(|c| c.get(1))
.map(|m| html_decode(m.as_str()))
}
fn html_decode(s: &str) -> String {
s.replace("&amp;", "&")
.replace("&lt;", "<")
.replace("&gt;", ">")
.replace("&quot;", "\"")
.replace("&#39;", "'")
.replace("&#064;", "@")
.replace("&#x2022;", "")
.replace("&hellip;", "")
}
fn collapse_whitespace(s: &str) -> String {
s.split_whitespace().collect::<Vec<_>>().join(" ")
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_post_reel_tv_urls() {
assert!(matches("https://www.instagram.com/p/DT-RICMjeK5/"));
assert!(matches(
"https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"
));
assert!(matches("https://www.instagram.com/reel/abc123/"));
assert!(matches("https://www.instagram.com/tv/abc123/"));
assert!(!matches("https://www.instagram.com/ticketswave"));
assert!(!matches("https://www.instagram.com/"));
assert!(!matches("https://example.com/p/abc/"));
}
#[test]
fn parse_shortcode_reads_each_kind() {
assert_eq!(
parse_shortcode("https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"),
Some(("post", "DT-RICMjeK5".into()))
);
assert_eq!(
parse_shortcode("https://www.instagram.com/reel/abc123/"),
Some(("reel", "abc123".into()))
);
assert_eq!(
parse_shortcode("https://www.instagram.com/tv/abc123"),
Some(("tv", "abc123".into()))
);
}
#[test]
fn parse_username_pulls_anchor_text() {
let html = r#"<a class="CaptionUsername" href="...">ticketswave</a>"#;
assert_eq!(parse_username(html).as_deref(), Some("ticketswave"));
}
#[test]
fn parse_caption_strips_username_anchor() {
let html = r#"<div class="Caption"><a class="CaptionUsername" href="...">ticketswave</a> Some caption text here</div>"#;
assert_eq!(
parse_caption(html).as_deref(),
Some("Some caption text here")
);
}
}

View file

@ -0,0 +1,465 @@
//! Instagram profile structured extractor.
//!
//! Hits Instagram's internal `web_profile_info` endpoint at
//! `instagram.com/api/v1/users/web_profile_info/?username=X`. The
//! `x-ig-app-id` header is Instagram's own public web-app id (not a
//! secret) — the same value Instagram's own JavaScript bundle sends.
//!
//! Returns the full profile (bio, exact follower count, verified /
//! business flags, profile picture) plus the **12 most recent posts**
//! with shortcodes, like counts, types, thumbnails, and caption
//! previews. Callers can fan out to `/v1/scrape/instagram_post` per
//! shortcode to get the full caption + media.
//!
//! Pagination beyond 12 requires authenticated cookies + a CSRF token;
//! we accept that as the practical ceiling for the unauth path. The
//! cloud (with stored sessions) can paginate later as a follow-up.
//!
//! Falls back to OG-tag scraping of the public profile page if the API
//! returns 401/403 — Instagram has tightened this endpoint multiple
//! times, so we keep the second path warm.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "instagram_profile",
label: "Instagram profile",
description: "Returns full profile metadata + the 12 most recent posts (shortcode, url, type, likes, thumbnail).",
url_patterns: &["https://www.instagram.com/{username}/"],
};
/// Instagram's own public web-app identifier. Sent by their JS bundle
/// on every API call, accepted by the unauth endpoint, not a secret.
const IG_APP_ID: &str = "936619743392459";
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if !matches!(host, "www.instagram.com" | "instagram.com") {
return false;
}
let path = url
.split("://")
.nth(1)
.and_then(|s| s.split_once('/'))
.map(|(_, p)| p)
.unwrap_or("");
let stripped = path
.split(['?', '#'])
.next()
.unwrap_or("")
.trim_end_matches('/');
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
segs.len() == 1 && !RESERVED.contains(&segs[0])
}
const RESERVED: &[&str] = &[
"p",
"reel",
"reels",
"tv",
"explore",
"stories",
"directory",
"accounts",
"about",
"developer",
"press",
"api",
"ads",
"blog",
"fragments",
"terms",
"privacy",
"session",
"login",
"signup",
];
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let username = parse_username(url).ok_or_else(|| {
FetchError::Build(format!(
"instagram_profile: cannot parse username from '{url}'"
))
})?;
let api_url =
format!("https://www.instagram.com/api/v1/users/web_profile_info/?username={username}");
let extra_headers: &[(&str, &str)] = &[
("x-ig-app-id", IG_APP_ID),
("accept", "*/*"),
("sec-fetch-site", "same-origin"),
("x-requested-with", "XMLHttpRequest"),
];
let resp = client.fetch_with_headers(&api_url, extra_headers).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"instagram_profile: '{username}' not found"
)));
}
// Auth wall fallback: Instagram occasionally tightens this endpoint
// and starts returning 401/403/302 to a login page. When that
// happens we still want to give the caller something useful — the
// OG tags from the public HTML page (no posts list, but bio etc).
if !(200..300).contains(&resp.status) {
return og_fallback(client, &username, url, resp.status).await;
}
let body: ApiResponse = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("instagram_profile parse: {e}")))?;
let user = body.data.user;
let recent_posts: Vec<Value> = user
.edge_owner_to_timeline_media
.as_ref()
.map(|m| m.edges.iter().map(|e| post_summary(&e.node)).collect())
.unwrap_or_default();
Ok(json!({
"url": url,
"canonical_url": format!("https://www.instagram.com/{username}/"),
"username": user.username.unwrap_or(username),
"data_completeness": "api",
"user_id": user.id,
"full_name": user.full_name,
"biography": user.biography,
"biography_links": user.bio_links,
"external_url": user.external_url,
"category": user.category_name,
"follower_count": user.edge_followed_by.map(|c| c.count),
"following_count": user.edge_follow.map(|c| c.count),
"post_count": user.edge_owner_to_timeline_media.as_ref().map(|m| m.count),
"is_verified": user.is_verified,
"is_private": user.is_private,
"is_business": user.is_business_account,
"is_professional": user.is_professional_account,
"profile_pic_url": user.profile_pic_url_hd.or(user.profile_pic_url),
"recent_posts": recent_posts,
}))
}
/// Build the per-post summary the caller fans out from. Includes a
/// constructed `url` so the loop is `for p in recent_posts: scrape('instagram_post', p.url)`.
fn post_summary(n: &MediaNode) -> Value {
let kind = classify(n);
let url = match kind {
"reel" => format!(
"https://www.instagram.com/reel/{}/",
n.shortcode.as_deref().unwrap_or("")
),
_ => format!(
"https://www.instagram.com/p/{}/",
n.shortcode.as_deref().unwrap_or("")
),
};
let caption = n
.edge_media_to_caption
.as_ref()
.and_then(|c| c.edges.first())
.and_then(|e| e.node.text.clone());
json!({
"shortcode": n.shortcode,
"url": url,
"kind": kind,
"is_video": n.is_video.unwrap_or(false),
"video_views": n.video_view_count,
"thumbnail_url": n.thumbnail_src.clone().or_else(|| n.display_url.clone()),
"display_url": n.display_url,
"like_count": n.edge_media_preview_like.as_ref().map(|c| c.count),
"comment_count": n.edge_media_to_comment.as_ref().map(|c| c.count),
"taken_at": n.taken_at_timestamp,
"caption": caption,
"alt_text": n.accessibility_caption,
"dimensions": n.dimensions.as_ref().map(|d| json!({"width": d.width, "height": d.height})),
"product_type": n.product_type,
})
}
/// Best-effort post-type classification. `clips` is reels; `feed` is
/// the regular grid. Sidecar = multi-photo carousel.
fn classify(n: &MediaNode) -> &'static str {
if n.product_type.as_deref() == Some("clips") {
return "reel";
}
match n.typename.as_deref() {
Some("GraphSidecar") => "carousel",
Some("GraphVideo") => "video",
Some("GraphImage") => "photo",
_ => "post",
}
}
/// Fallback when the API path is blocked: hit the public profile HTML,
/// pull whatever OG tags we can. Returns less data and explicitly
/// flags `data_completeness: "og_only"` so callers know.
async fn og_fallback(
client: &dyn Fetcher,
username: &str,
original_url: &str,
api_status: u16,
) -> Result<Value, FetchError> {
let canonical = format!("https://www.instagram.com/{username}/");
let resp = client.fetch(&canonical).await?;
if resp.status != 200 {
return Err(FetchError::Build(format!(
"instagram_profile: api status {api_status}, html status {} for {username}",
resp.status
)));
}
let og = parse_og_tags(&resp.html);
let (followers, following, posts) =
parse_counts_from_og_description(og.get("description").map(String::as_str));
Ok(json!({
"url": original_url,
"canonical_url": canonical,
"username": username,
"data_completeness": "og_only",
"fallback_reason": format!("api returned {api_status}"),
"full_name": parse_full_name(&og.get("title").cloned().unwrap_or_default()),
"follower_count": followers,
"following_count": following,
"post_count": posts,
"profile_pic_url": og.get("image").cloned(),
"biography": null_value(),
"is_verified": null_value(),
"is_business": null_value(),
"recent_posts": Vec::<Value>::new(),
}))
}
fn null_value() -> Value {
Value::Null
}
// ---------------------------------------------------------------------------
// URL parsing
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn parse_username(url: &str) -> Option<String> {
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
stripped
.split('/')
.find(|s| !s.is_empty())
.map(|s| s.to_string())
}
// ---------------------------------------------------------------------------
// OG-fallback helpers (kept self-contained — same shape as the previous
// version we shipped, retained as the safety net)
// ---------------------------------------------------------------------------
fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
use regex::Regex;
use std::sync::OnceLock;
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
let mut out = std::collections::HashMap::new();
for c in re.captures_iter(html) {
let k = c
.get(1)
.map(|m| m.as_str().to_lowercase())
.unwrap_or_default();
let v = c
.get(2)
.map(|m| html_decode(m.as_str()))
.unwrap_or_default();
out.entry(k).or_insert(v);
}
out
}
fn parse_full_name(og_title: &str) -> Option<String> {
if og_title.is_empty() {
return None;
}
let decoded = html_decode(og_title);
let trimmed = decoded.split('(').next().unwrap_or(&decoded).trim();
if trimmed.is_empty() {
None
} else {
Some(trimmed.to_string())
}
}
fn parse_counts_from_og_description(desc: Option<&str>) -> (Option<i64>, Option<i64>, Option<i64>) {
let Some(text) = desc else {
return (None, None, None);
};
let decoded = html_decode(text);
use regex::Regex;
use std::sync::OnceLock;
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r"(?i)([\d.,]+[KMB]?)\s*Followers,\s*([\d.,]+[KMB]?)\s*Following,\s*([\d.,]+[KMB]?)\s*Posts").unwrap()
});
if let Some(c) = re.captures(&decoded) {
return (
c.get(1).and_then(|m| parse_compact_number(m.as_str())),
c.get(2).and_then(|m| parse_compact_number(m.as_str())),
c.get(3).and_then(|m| parse_compact_number(m.as_str())),
);
}
(None, None, None)
}
fn parse_compact_number(s: &str) -> Option<i64> {
let s = s.trim();
let (num_str, mul) = match s.chars().last() {
Some('K') => (&s[..s.len() - 1], 1_000i64),
Some('M') => (&s[..s.len() - 1], 1_000_000i64),
Some('B') => (&s[..s.len() - 1], 1_000_000_000i64),
_ => (s, 1i64),
};
let cleaned: String = num_str.chars().filter(|c| *c != ',').collect();
cleaned.parse::<f64>().ok().map(|f| (f * mul as f64) as i64)
}
fn html_decode(s: &str) -> String {
s.replace("&amp;", "&")
.replace("&lt;", "<")
.replace("&gt;", ">")
.replace("&quot;", "\"")
.replace("&#39;", "'")
.replace("&#064;", "@")
.replace("&#x2022;", "")
.replace("&hellip;", "")
}
// ---------------------------------------------------------------------------
// Instagram web_profile_info API types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct ApiResponse {
data: ApiData,
}
#[derive(Deserialize)]
struct ApiData {
user: User,
}
#[derive(Deserialize)]
struct User {
id: Option<String>,
username: Option<String>,
full_name: Option<String>,
biography: Option<String>,
bio_links: Option<Vec<serde_json::Value>>,
external_url: Option<String>,
category_name: Option<String>,
profile_pic_url: Option<String>,
profile_pic_url_hd: Option<String>,
is_verified: Option<bool>,
is_private: Option<bool>,
is_business_account: Option<bool>,
is_professional_account: Option<bool>,
edge_followed_by: Option<EdgeCount>,
edge_follow: Option<EdgeCount>,
edge_owner_to_timeline_media: Option<MediaEdges>,
}
#[derive(Deserialize)]
struct EdgeCount {
count: i64,
}
#[derive(Deserialize)]
struct MediaEdges {
count: i64,
edges: Vec<MediaEdge>,
}
#[derive(Deserialize)]
struct MediaEdge {
node: MediaNode,
}
#[derive(Deserialize)]
struct MediaNode {
#[serde(rename = "__typename")]
typename: Option<String>,
shortcode: Option<String>,
is_video: Option<bool>,
video_view_count: Option<i64>,
display_url: Option<String>,
thumbnail_src: Option<String>,
accessibility_caption: Option<String>,
taken_at_timestamp: Option<i64>,
product_type: Option<String>,
dimensions: Option<Dimensions>,
edge_media_preview_like: Option<EdgeCount>,
edge_media_to_comment: Option<EdgeCount>,
edge_media_to_caption: Option<CaptionEdges>,
}
#[derive(Deserialize)]
struct Dimensions {
width: i64,
height: i64,
}
#[derive(Deserialize)]
struct CaptionEdges {
edges: Vec<CaptionEdge>,
}
#[derive(Deserialize)]
struct CaptionEdge {
node: CaptionNode,
}
#[derive(Deserialize)]
struct CaptionNode {
text: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_profile_urls() {
assert!(matches("https://www.instagram.com/ticketswave"));
assert!(matches("https://www.instagram.com/ticketswave/"));
assert!(matches("https://instagram.com/0xmassi/?hl=en"));
assert!(!matches("https://www.instagram.com/p/DT-RICMjeK5/"));
assert!(!matches("https://www.instagram.com/explore"));
assert!(!matches("https://www.instagram.com/"));
assert!(!matches("https://example.com/foo"));
}
#[test]
fn parse_full_name_strips_handle() {
assert_eq!(
parse_full_name("Ticket Wave (&#064;ticketswave) &#x2022; Instagram photos and videos"),
Some("Ticket Wave".into())
);
}
#[test]
fn compact_number_handles_kmb() {
assert_eq!(parse_compact_number("18K"), Some(18_000));
assert_eq!(parse_compact_number("1.5M"), Some(1_500_000));
assert_eq!(parse_compact_number("1,234"), Some(1_234));
assert_eq!(parse_compact_number("641"), Some(641));
}
}

View file

@ -0,0 +1,266 @@
//! LinkedIn post structured extractor.
//!
//! Uses the public embed endpoint `/embed/feed/update/{urn}` which
//! LinkedIn provides for sites that want to render a post inline. No
//! auth required, returns SSR HTML with the full post body, OG tags,
//! image, and a link back to the original post.
//!
//! Accepts both URN forms (`urn:li:share:N` and `urn:li:activity:N`)
//! and pretty post URLs (`/posts/{user}_{slug}-{id}-{suffix}`) by
//! pulling the trailing numeric id and converting to an activity URN.
use regex::Regex;
use serde_json::{Value, json};
use std::sync::OnceLock;
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "linkedin_post",
label: "LinkedIn post",
description: "Returns post body, author name, image, and original URL via LinkedIn's public embed endpoint.",
url_patterns: &[
"https://www.linkedin.com/feed/update/urn:li:share:{id}",
"https://www.linkedin.com/feed/update/urn:li:activity:{id}",
"https://www.linkedin.com/posts/{user}_{slug}-{id}-{suffix}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if !matches!(host, "www.linkedin.com" | "linkedin.com") {
return false;
}
url.contains("/feed/update/urn:li:") || url.contains("/posts/")
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let urn = extract_urn(url).ok_or_else(|| {
FetchError::Build(format!(
"linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})"
))
})?;
let embed_url = format!("https://www.linkedin.com/embed/feed/update/{urn}");
let resp = client.fetch(&embed_url).await?;
if resp.status != 200 {
return Err(FetchError::Build(format!(
"linkedin embed returned status {} for {urn}",
resp.status
)));
}
let html = &resp.html;
let og = parse_og_tags(html);
let body = parse_post_body(html);
let author = parse_author(html);
let canonical_url = og.get("url").cloned().unwrap_or_else(|| embed_url.clone());
Ok(json!({
"url": url,
"embed_url": embed_url,
"urn": urn,
"canonical_url": canonical_url,
"data_completeness": "embed",
"title": og.get("title").cloned(),
"body": body,
"author_name": author,
"image_url": og.get("image").cloned(),
"site_name": og.get("site_name").cloned().unwrap_or_else(|| "LinkedIn".into()),
}))
}
// ---------------------------------------------------------------------------
// URN extraction
// ---------------------------------------------------------------------------
/// Pull a `urn:li:share:N` or `urn:li:activity:N` from any LinkedIn URL.
/// `/posts/{slug}-{id}-{suffix}` URLs encode the activity id as the second-
/// to-last `-` separated chunk. Both forms map to a URN we can hit the
/// embed endpoint with.
fn extract_urn(url: &str) -> Option<String> {
if let Some(idx) = url.find("urn:li:") {
let tail = &url[idx..];
let end = tail.find(['/', '?', '#']).unwrap_or(tail.len());
let urn = &tail[..end];
// Validate shape: urn:li:{type}:{digits}
let mut parts = urn.split(':');
if parts.next() == Some("urn")
&& parts.next() == Some("li")
&& parts.next().is_some()
&& parts
.next()
.filter(|p| p.chars().all(|c| c.is_ascii_digit()))
.is_some()
{
return Some(urn.to_string());
}
}
// /posts/{user}_{slug}-{19-digit-id}-{4-char-hash}/ — id is the second-
// to-last segment after the last `-`.
if url.contains("/posts/") {
static RE: OnceLock<Regex> = OnceLock::new();
let re =
RE.get_or_init(|| Regex::new(r"/posts/[^/]*?-(\d{15,})-[A-Za-z0-9]{2,}/?").unwrap());
if let Some(c) = re.captures(url)
&& let Some(id) = c.get(1)
{
return Some(format!("urn:li:activity:{}", id.as_str()));
}
}
None
}
// ---------------------------------------------------------------------------
// HTML scraping
// ---------------------------------------------------------------------------
/// Pull `og:foo` → value pairs out of `<meta property="og:..." content="...">`.
/// Returns lowercased keys with leading `og:` stripped.
fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
let mut out = std::collections::HashMap::new();
for c in re.captures_iter(html) {
let k = c
.get(1)
.map(|m| m.as_str().to_lowercase())
.unwrap_or_default();
let v = c
.get(2)
.map(|m| html_decode(m.as_str()))
.unwrap_or_default();
out.entry(k).or_insert(v);
}
out
}
/// Extract the post body text from the embed page. LinkedIn renders it
/// inside `<p class="attributed-text-segment-list__content ...">{text}</p>`
/// where the inner content can include nested `<a>` tags for links.
fn parse_post_body(html: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(
r#"(?s)<p[^>]+class="[^"]*attributed-text-segment-list__content[^"]*"[^>]*>(.*?)</p>"#,
)
.unwrap()
});
let inner = re.captures(html).and_then(|c| c.get(1))?.as_str();
Some(strip_tags(inner).trim().to_string())
}
/// Author name lives in the `<title>` like:
/// "55 founding members are in… | Orc Dev"
/// The chunk after the final `|` is the author display name. Falls back
/// to the og:title minus the post body if there's no title.
fn parse_author(html: &str) -> Option<String> {
static RE_TITLE: OnceLock<Regex> = OnceLock::new();
let re = RE_TITLE.get_or_init(|| Regex::new(r"<title>([^<]+)</title>").unwrap());
let title = re.captures(html).and_then(|c| c.get(1))?.as_str();
title
.rsplit_once('|')
.map(|(_, name)| html_decode(name.trim()))
}
/// Replace the small set of HTML entities LinkedIn (and Instagram, etc.)
/// stuff into OG content attributes.
fn html_decode(s: &str) -> String {
s.replace("&amp;", "&")
.replace("&lt;", "<")
.replace("&gt;", ">")
.replace("&quot;", "\"")
.replace("&#39;", "'")
.replace("&#064;", "@")
.replace("&#x2022;", "")
.replace("&hellip;", "")
}
/// Crude HTML tag stripper for the post body. Preserves text inside
/// nested anchors so URLs don't disappear, and collapses runs of
/// whitespace introduced by line wrapping.
fn strip_tags(html: &str) -> String {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
let no_tags = re.replace_all(html, "").to_string();
html_decode(&no_tags)
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_li_post_urls() {
assert!(matches(
"https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"
));
assert!(matches(
"https://www.linkedin.com/feed/update/urn:li:activity:7452618583290892288"
));
assert!(matches(
"https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c"
));
assert!(!matches("https://www.linkedin.com/in/foo"));
assert!(!matches("https://www.linkedin.com/"));
assert!(!matches("https://example.com/feed/update/urn:li:share:1"));
}
#[test]
fn extract_urn_from_share_url() {
assert_eq!(
extract_urn("https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"),
Some("urn:li:share:7452618582213144577".into())
);
}
#[test]
fn extract_urn_from_pretty_post_url() {
assert_eq!(
extract_urn(
"https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c/"
),
Some("urn:li:activity:7452618583290892288".into())
);
}
#[test]
fn parse_og_tags_basic() {
let html = r#"<meta property="og:image" content="https://x.com/a.png">
<meta property="og:url" content="https://example.com/x">"#;
let og = parse_og_tags(html);
assert_eq!(
og.get("image").map(String::as_str),
Some("https://x.com/a.png")
);
assert_eq!(
og.get("url").map(String::as_str),
Some("https://example.com/x")
);
}
#[test]
fn parse_post_body_strips_anchor_tags() {
let html = r#"<p class="attributed-text-segment-list__content text-color-text" dir="ltr">Hello <a href="x">link</a> world</p>"#;
assert_eq!(parse_post_body(html).as_deref(), Some("Hello link world"));
}
#[test]
fn html_decode_handles_common_entities() {
assert_eq!(html_decode("AT&amp;T &#064;jane"), "AT&T @jane");
}
}

View file

@ -0,0 +1,502 @@
//! Vertical extractors: site-specific parsers that return typed JSON
//! instead of generic markdown.
//!
//! Each extractor handles a single site or platform and exposes:
//! - `matches(url)` to claim ownership of a URL pattern
//! - `extract(client, url)` to fetch + parse into a typed JSON `Value`
//! - `INFO` static for the catalog (`/v1/extractors`)
//!
//! The dispatch in this module is a simple `match`-style chain rather than
//! a trait registry. With ~30 extractors that's still fast and avoids the
//! ceremony of dynamic dispatch. If we hit 50+ we'll revisit.
//!
//! Extractors prefer official JSON APIs over HTML scraping where one
//! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have
//! one). HTML extraction is the fallback for sites that don't.
pub mod amazon_product;
pub mod arxiv;
pub mod crates_io;
pub mod dev_to;
pub mod docker_hub;
pub mod ebay_listing;
pub mod ecommerce_product;
pub mod etsy_listing;
pub mod github_issue;
pub mod github_pr;
pub mod github_release;
pub mod github_repo;
pub mod hackernews;
pub mod huggingface_dataset;
pub mod huggingface_model;
pub mod instagram_post;
pub mod instagram_profile;
pub mod linkedin_post;
pub mod npm;
pub mod pypi;
pub mod reddit;
pub mod shopify_collection;
pub mod shopify_product;
pub mod stackoverflow;
pub mod substack_post;
pub mod trustpilot_reviews;
pub mod woocommerce_product;
pub mod youtube_video;
use serde::Serialize;
use serde_json::Value;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
/// Public catalog entry for `/v1/extractors`. Stable shape — clients
/// rely on `name` to pick the right `/v1/scrape/{name}` route.
#[derive(Debug, Clone, Serialize)]
pub struct ExtractorInfo {
/// URL-safe identifier (`reddit`, `hackernews`, `github_repo`, ...).
pub name: &'static str,
/// Human-friendly display name.
pub label: &'static str,
/// One-line description of what the extractor returns.
pub description: &'static str,
/// Glob-ish URL pattern(s) the extractor claims. For documentation;
/// the actual matching is done by the extractor's `matches` fn.
pub url_patterns: &'static [&'static str],
}
/// Full catalog. Order is stable; new entries append.
pub fn list() -> Vec<ExtractorInfo> {
vec![
reddit::INFO,
hackernews::INFO,
github_repo::INFO,
github_pr::INFO,
github_issue::INFO,
github_release::INFO,
pypi::INFO,
npm::INFO,
crates_io::INFO,
huggingface_model::INFO,
huggingface_dataset::INFO,
arxiv::INFO,
docker_hub::INFO,
dev_to::INFO,
stackoverflow::INFO,
substack_post::INFO,
youtube_video::INFO,
linkedin_post::INFO,
instagram_post::INFO,
instagram_profile::INFO,
shopify_product::INFO,
shopify_collection::INFO,
ecommerce_product::INFO,
woocommerce_product::INFO,
amazon_product::INFO,
ebay_listing::INFO,
etsy_listing::INFO,
trustpilot_reviews::INFO,
]
}
/// Auto-detect mode: try every extractor's `matches`, return the first
/// one that claims the URL. Used by `/v1/scrape` when the caller doesn't
/// pick a vertical explicitly.
pub async fn dispatch_by_url(
client: &dyn Fetcher,
url: &str,
) -> Option<Result<(&'static str, Value), FetchError>> {
if reddit::matches(url) {
return Some(
reddit::extract(client, url)
.await
.map(|v| (reddit::INFO.name, v)),
);
}
if hackernews::matches(url) {
return Some(
hackernews::extract(client, url)
.await
.map(|v| (hackernews::INFO.name, v)),
);
}
if github_repo::matches(url) {
return Some(
github_repo::extract(client, url)
.await
.map(|v| (github_repo::INFO.name, v)),
);
}
if pypi::matches(url) {
return Some(
pypi::extract(client, url)
.await
.map(|v| (pypi::INFO.name, v)),
);
}
if npm::matches(url) {
return Some(npm::extract(client, url).await.map(|v| (npm::INFO.name, v)));
}
if github_pr::matches(url) {
return Some(
github_pr::extract(client, url)
.await
.map(|v| (github_pr::INFO.name, v)),
);
}
if github_issue::matches(url) {
return Some(
github_issue::extract(client, url)
.await
.map(|v| (github_issue::INFO.name, v)),
);
}
if github_release::matches(url) {
return Some(
github_release::extract(client, url)
.await
.map(|v| (github_release::INFO.name, v)),
);
}
if crates_io::matches(url) {
return Some(
crates_io::extract(client, url)
.await
.map(|v| (crates_io::INFO.name, v)),
);
}
if huggingface_model::matches(url) {
return Some(
huggingface_model::extract(client, url)
.await
.map(|v| (huggingface_model::INFO.name, v)),
);
}
if huggingface_dataset::matches(url) {
return Some(
huggingface_dataset::extract(client, url)
.await
.map(|v| (huggingface_dataset::INFO.name, v)),
);
}
if arxiv::matches(url) {
return Some(
arxiv::extract(client, url)
.await
.map(|v| (arxiv::INFO.name, v)),
);
}
if docker_hub::matches(url) {
return Some(
docker_hub::extract(client, url)
.await
.map(|v| (docker_hub::INFO.name, v)),
);
}
if dev_to::matches(url) {
return Some(
dev_to::extract(client, url)
.await
.map(|v| (dev_to::INFO.name, v)),
);
}
if stackoverflow::matches(url) {
return Some(
stackoverflow::extract(client, url)
.await
.map(|v| (stackoverflow::INFO.name, v)),
);
}
if linkedin_post::matches(url) {
return Some(
linkedin_post::extract(client, url)
.await
.map(|v| (linkedin_post::INFO.name, v)),
);
}
if instagram_post::matches(url) {
return Some(
instagram_post::extract(client, url)
.await
.map(|v| (instagram_post::INFO.name, v)),
);
}
if instagram_profile::matches(url) {
return Some(
instagram_profile::extract(client, url)
.await
.map(|v| (instagram_profile::INFO.name, v)),
);
}
// Antibot-gated verticals with unique hosts: safe to auto-dispatch
// because the matcher can't confuse the URL for anything else. The
// extractor's smart_fetch_html path handles the blocked-without-
// API-key case with a clear actionable error.
if amazon_product::matches(url) {
return Some(
amazon_product::extract(client, url)
.await
.map(|v| (amazon_product::INFO.name, v)),
);
}
if ebay_listing::matches(url) {
return Some(
ebay_listing::extract(client, url)
.await
.map(|v| (ebay_listing::INFO.name, v)),
);
}
if etsy_listing::matches(url) {
return Some(
etsy_listing::extract(client, url)
.await
.map(|v| (etsy_listing::INFO.name, v)),
);
}
if trustpilot_reviews::matches(url) {
return Some(
trustpilot_reviews::extract(client, url)
.await
.map(|v| (trustpilot_reviews::INFO.name, v)),
);
}
if youtube_video::matches(url) {
return Some(
youtube_video::extract(client, url)
.await
.map(|v| (youtube_video::INFO.name, v)),
);
}
// NOTE: shopify_product, shopify_collection, ecommerce_product,
// woocommerce_product, and substack_post are intentionally NOT
// in auto-dispatch. Their `matches()` functions are permissive
// (any URL with `/products/`, `/product/`, `/p/`, etc.) and
// claiming those generically would steal URLs from the default
// `/v1/scrape` markdown flow. Callers opt in via
// `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`.
None
}
/// Explicit mode: caller picked the vertical (`POST /v1/scrape/reddit`).
/// We still validate that the URL plausibly belongs to that vertical so
/// users get a clear "wrong route" error instead of a confusing parse
/// failure deep in the extractor.
pub async fn dispatch_by_name(
client: &dyn Fetcher,
name: &str,
url: &str,
) -> Result<Value, ExtractorDispatchError> {
match name {
n if n == reddit::INFO.name => {
run_or_mismatch(reddit::matches(url), n, url, || {
reddit::extract(client, url)
})
.await
}
n if n == hackernews::INFO.name => {
run_or_mismatch(hackernews::matches(url), n, url, || {
hackernews::extract(client, url)
})
.await
}
n if n == github_repo::INFO.name => {
run_or_mismatch(github_repo::matches(url), n, url, || {
github_repo::extract(client, url)
})
.await
}
n if n == pypi::INFO.name => {
run_or_mismatch(pypi::matches(url), n, url, || pypi::extract(client, url)).await
}
n if n == npm::INFO.name => {
run_or_mismatch(npm::matches(url), n, url, || npm::extract(client, url)).await
}
n if n == github_pr::INFO.name => {
run_or_mismatch(github_pr::matches(url), n, url, || {
github_pr::extract(client, url)
})
.await
}
n if n == github_issue::INFO.name => {
run_or_mismatch(github_issue::matches(url), n, url, || {
github_issue::extract(client, url)
})
.await
}
n if n == github_release::INFO.name => {
run_or_mismatch(github_release::matches(url), n, url, || {
github_release::extract(client, url)
})
.await
}
n if n == crates_io::INFO.name => {
run_or_mismatch(crates_io::matches(url), n, url, || {
crates_io::extract(client, url)
})
.await
}
n if n == huggingface_model::INFO.name => {
run_or_mismatch(huggingface_model::matches(url), n, url, || {
huggingface_model::extract(client, url)
})
.await
}
n if n == huggingface_dataset::INFO.name => {
run_or_mismatch(huggingface_dataset::matches(url), n, url, || {
huggingface_dataset::extract(client, url)
})
.await
}
n if n == arxiv::INFO.name => {
run_or_mismatch(arxiv::matches(url), n, url, || arxiv::extract(client, url)).await
}
n if n == docker_hub::INFO.name => {
run_or_mismatch(docker_hub::matches(url), n, url, || {
docker_hub::extract(client, url)
})
.await
}
n if n == dev_to::INFO.name => {
run_or_mismatch(dev_to::matches(url), n, url, || {
dev_to::extract(client, url)
})
.await
}
n if n == stackoverflow::INFO.name => {
run_or_mismatch(stackoverflow::matches(url), n, url, || {
stackoverflow::extract(client, url)
})
.await
}
n if n == linkedin_post::INFO.name => {
run_or_mismatch(linkedin_post::matches(url), n, url, || {
linkedin_post::extract(client, url)
})
.await
}
n if n == instagram_post::INFO.name => {
run_or_mismatch(instagram_post::matches(url), n, url, || {
instagram_post::extract(client, url)
})
.await
}
n if n == instagram_profile::INFO.name => {
run_or_mismatch(instagram_profile::matches(url), n, url, || {
instagram_profile::extract(client, url)
})
.await
}
n if n == shopify_product::INFO.name => {
run_or_mismatch(shopify_product::matches(url), n, url, || {
shopify_product::extract(client, url)
})
.await
}
n if n == ecommerce_product::INFO.name => {
run_or_mismatch(ecommerce_product::matches(url), n, url, || {
ecommerce_product::extract(client, url)
})
.await
}
n if n == amazon_product::INFO.name => {
run_or_mismatch(amazon_product::matches(url), n, url, || {
amazon_product::extract(client, url)
})
.await
}
n if n == ebay_listing::INFO.name => {
run_or_mismatch(ebay_listing::matches(url), n, url, || {
ebay_listing::extract(client, url)
})
.await
}
n if n == etsy_listing::INFO.name => {
run_or_mismatch(etsy_listing::matches(url), n, url, || {
etsy_listing::extract(client, url)
})
.await
}
n if n == trustpilot_reviews::INFO.name => {
run_or_mismatch(trustpilot_reviews::matches(url), n, url, || {
trustpilot_reviews::extract(client, url)
})
.await
}
n if n == youtube_video::INFO.name => {
run_or_mismatch(youtube_video::matches(url), n, url, || {
youtube_video::extract(client, url)
})
.await
}
n if n == substack_post::INFO.name => {
run_or_mismatch(substack_post::matches(url), n, url, || {
substack_post::extract(client, url)
})
.await
}
n if n == shopify_collection::INFO.name => {
run_or_mismatch(shopify_collection::matches(url), n, url, || {
shopify_collection::extract(client, url)
})
.await
}
n if n == woocommerce_product::INFO.name => {
run_or_mismatch(woocommerce_product::matches(url), n, url, || {
woocommerce_product::extract(client, url)
})
.await
}
_ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
}
}
/// Errors that the dispatcher itself raises (vs. errors from inside an
/// extractor, which come back wrapped in `Fetch`).
#[derive(Debug, thiserror::Error)]
pub enum ExtractorDispatchError {
#[error("unknown vertical: '{0}'")]
UnknownVertical(String),
#[error("URL '{url}' does not match the '{vertical}' extractor")]
UrlMismatch { vertical: String, url: String },
#[error(transparent)]
Fetch(#[from] FetchError),
}
/// Helper: when the caller explicitly picked a vertical but their URL
/// doesn't match it, return `UrlMismatch` instead of running the
/// extractor (which would just fail with a less-clear error).
async fn run_or_mismatch<F, Fut>(
matches: bool,
vertical: &str,
url: &str,
f: F,
) -> Result<Value, ExtractorDispatchError>
where
F: FnOnce() -> Fut,
Fut: std::future::Future<Output = Result<Value, FetchError>>,
{
if !matches {
return Err(ExtractorDispatchError::UrlMismatch {
vertical: vertical.to_string(),
url: url.to_string(),
});
}
f().await.map_err(ExtractorDispatchError::Fetch)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn list_is_non_empty_and_unique() {
let entries = list();
assert!(!entries.is_empty());
let mut names: Vec<_> = entries.iter().map(|e| e.name).collect();
names.sort();
let before = names.len();
names.dedup();
assert_eq!(before, names.len(), "extractor names must be unique");
}
}

View file

@ -0,0 +1,235 @@
//! npm package structured extractor.
//!
//! Uses two npm-run APIs:
//! - `registry.npmjs.org/{name}` for full package metadata
//! - `api.npmjs.org/downloads/point/last-week/{name}` for usage signal
//!
//! The registry API returns the *full* document including every version
//! ever published, which can be tens of MB for popular packages
//! (`@types/node` etc). We strip down to the latest version's manifest
//! and a count of releases — full history would explode the response.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "npm",
label: "npm package",
description: "Returns package metadata: latest version manifest, dependencies, weekly downloads, license.",
url_patterns: &["https://www.npmjs.com/package/{name}"],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host != "www.npmjs.com" && host != "npmjs.com" {
return false;
}
url.contains("/package/")
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let name = parse_name(url)
.ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?;
let registry_url = format!("https://registry.npmjs.org/{}", urlencode_segment(&name));
let resp = client.fetch(&registry_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"npm: package '{name}' not found"
)));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"npm registry returned status {}",
resp.status
)));
}
let pkg: PackageDoc = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("npm registry parse: {e}")))?;
// Resolve "latest" to a concrete version.
let latest_version = pkg
.dist_tags
.as_ref()
.and_then(|t| t.get("latest"))
.cloned()
.or_else(|| pkg.versions.as_ref().and_then(|v| v.keys().last().cloned()));
let latest_manifest = latest_version
.as_deref()
.and_then(|v| pkg.versions.as_ref().and_then(|m| m.get(v)));
let release_count = pkg.versions.as_ref().map(|v| v.len()).unwrap_or(0);
let latest_release_date = latest_version
.as_deref()
.and_then(|v| pkg.time.as_ref().and_then(|t| t.get(v).cloned()));
// Best-effort weekly downloads. If the api.npmjs.org call fails we
// surface `null` rather than failing the whole extractor — npm
// sometimes 503s the downloads endpoint while the registry is up.
let weekly_downloads = fetch_weekly_downloads(client, &name).await.ok();
Ok(json!({
"url": url,
"name": pkg.name.clone().unwrap_or(name.clone()),
"description": pkg.description,
"latest_version": latest_version,
"license": latest_manifest.and_then(|m| m.license.clone()),
"homepage": pkg.homepage,
"repository": pkg.repository.as_ref().and_then(|r| r.url.clone()),
"dependencies": latest_manifest.and_then(|m| m.dependencies.clone()),
"dev_dependencies": latest_manifest.and_then(|m| m.dev_dependencies.clone()),
"peer_dependencies": latest_manifest.and_then(|m| m.peer_dependencies.clone()),
"keywords": pkg.keywords,
"maintainers": pkg.maintainers,
"deprecated": latest_manifest.and_then(|m| m.deprecated.clone()),
"release_count": release_count,
"latest_release_date": latest_release_date,
"weekly_downloads": weekly_downloads,
}))
}
async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result<i64, FetchError> {
let url = format!(
"https://api.npmjs.org/downloads/point/last-week/{}",
urlencode_segment(name)
);
let resp = client.fetch(&url).await?;
if resp.status != 200 {
return Err(FetchError::Build(format!(
"npm downloads api status {}",
resp.status
)));
}
let dl: Downloads = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("npm downloads parse: {e}")))?;
Ok(dl.downloads)
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
/// Extract the package name from an npmjs.com URL. Handles scoped packages
/// (`/package/@scope/name`) and trailing path segments (`/v/x.y.z`).
fn parse_name(url: &str) -> Option<String> {
let after = url.split("/package/").nth(1)?;
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
let first = segs.next()?;
if first.starts_with('@') {
let second = segs.next()?;
Some(format!("{first}/{second}"))
} else {
Some(first.to_string())
}
}
/// `@scope/name` must encode the `/` for the registry path. Plain names
/// pass through untouched.
fn urlencode_segment(name: &str) -> String {
name.replace('/', "%2F")
}
// ---------------------------------------------------------------------------
// Registry types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct PackageDoc {
name: Option<String>,
description: Option<String>,
homepage: Option<serde_json::Value>, // sometimes string, sometimes object
repository: Option<Repository>,
keywords: Option<Vec<String>>,
maintainers: Option<Vec<Maintainer>>,
#[serde(rename = "dist-tags")]
dist_tags: Option<std::collections::BTreeMap<String, String>>,
versions: Option<std::collections::BTreeMap<String, VersionManifest>>,
time: Option<std::collections::BTreeMap<String, String>>,
}
#[derive(Deserialize, Default, Clone)]
struct VersionManifest {
license: Option<serde_json::Value>, // string or object
dependencies: Option<std::collections::BTreeMap<String, String>>,
#[serde(rename = "devDependencies")]
dev_dependencies: Option<std::collections::BTreeMap<String, String>>,
#[serde(rename = "peerDependencies")]
peer_dependencies: Option<std::collections::BTreeMap<String, String>>,
// `deprecated` is sometimes a bool and sometimes a string in the
// registry. serde_json::Value covers both without failing the parse.
deprecated: Option<serde_json::Value>,
}
#[derive(Deserialize)]
struct Repository {
url: Option<String>,
}
#[derive(Deserialize, Clone)]
struct Maintainer {
name: Option<String>,
email: Option<String>,
}
impl serde::Serialize for Maintainer {
fn serialize<S: serde::Serializer>(&self, s: S) -> Result<S::Ok, S::Error> {
use serde::ser::SerializeMap;
let mut m = s.serialize_map(Some(2))?;
m.serialize_entry("name", &self.name)?;
m.serialize_entry("email", &self.email)?;
m.end()
}
}
#[derive(Deserialize)]
struct Downloads {
downloads: i64,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_npm_package_urls() {
assert!(matches("https://www.npmjs.com/package/react"));
assert!(matches("https://www.npmjs.com/package/@types/node"));
assert!(matches("https://npmjs.com/package/lodash"));
assert!(!matches("https://www.npmjs.com/"));
assert!(!matches("https://example.com/package/foo"));
}
#[test]
fn parse_name_handles_scoped_and_unscoped() {
assert_eq!(
parse_name("https://www.npmjs.com/package/react"),
Some("react".into())
);
assert_eq!(
parse_name("https://www.npmjs.com/package/@types/node"),
Some("@types/node".into())
);
assert_eq!(
parse_name("https://www.npmjs.com/package/lodash/v/4.17.21"),
Some("lodash".into())
);
}
#[test]
fn urlencode_only_touches_scope_separator() {
assert_eq!(urlencode_segment("react"), "react");
assert_eq!(urlencode_segment("@types/node"), "@types%2Fnode");
}
}

View file

@ -0,0 +1,184 @@
//! PyPI package structured extractor.
//!
//! PyPI exposes a stable JSON API at `pypi.org/pypi/{name}/json` and
//! a versioned form at `pypi.org/pypi/{name}/{version}/json`. Both
//! return the full release info plus history. No auth, no rate limits
//! that we hit at normal usage.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "pypi",
label: "PyPI package",
description: "Returns package metadata: latest version, dependencies, license, release history.",
url_patterns: &[
"https://pypi.org/project/{name}/",
"https://pypi.org/project/{name}/{version}/",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host != "pypi.org" && host != "www.pypi.org" {
return false;
}
url.contains("/project/")
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (name, version) = parse_project(url).ok_or_else(|| {
FetchError::Build(format!("pypi: cannot parse package name from '{url}'"))
})?;
let api_url = match &version {
Some(v) => format!("https://pypi.org/pypi/{name}/{v}/json"),
None => format!("https://pypi.org/pypi/{name}/json"),
};
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"pypi: package '{name}' not found"
)));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"pypi api returned status {}",
resp.status
)));
}
let pkg: PypiResponse = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("pypi parse: {e}")))?;
let info = pkg.info;
let release_count = pkg.releases.as_ref().map(|r| r.len()).unwrap_or(0);
// Latest release date = max upload time across files in the latest version.
let latest_release_date = pkg
.releases
.as_ref()
.and_then(|map| info.version.as_deref().and_then(|v| map.get(v)))
.and_then(|files| files.iter().filter_map(|f| f.upload_time.clone()).max());
// Drop the long description from the JSON shape — it's frequently a 50KB
// README and bloats responses. Callers who need it can hit /v1/scrape.
Ok(json!({
"url": url,
"name": info.name,
"version": info.version,
"summary": info.summary,
"homepage": info.home_page,
"license": info.license,
"license_classifier": pick_license_classifier(&info.classifiers),
"author": info.author,
"author_email": info.author_email,
"maintainer": info.maintainer,
"requires_python": info.requires_python,
"requires_dist": info.requires_dist,
"keywords": info.keywords,
"classifiers": info.classifiers,
"yanked": info.yanked,
"yanked_reason": info.yanked_reason,
"project_urls": info.project_urls,
"release_count": release_count,
"latest_release_date": latest_release_date,
}))
}
/// PyPI puts the SPDX-ish license under classifiers like
/// `License :: OSI Approved :: Apache Software License`. Surface the most
/// specific one when the `license` field itself is empty/junk.
fn pick_license_classifier(classifiers: &Option<Vec<String>>) -> Option<String> {
classifiers
.as_ref()?
.iter()
.filter(|c| c.starts_with("License ::"))
.max_by_key(|c| c.len())
.cloned()
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn parse_project(url: &str) -> Option<(String, Option<String>)> {
let after = url.split("/project/").nth(1)?;
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
let name = segs.next()?.to_string();
let version = segs.next().map(|v| v.to_string());
Some((name, version))
}
// ---------------------------------------------------------------------------
// PyPI API types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct PypiResponse {
info: Info,
releases: Option<std::collections::BTreeMap<String, Vec<File>>>,
}
#[derive(Deserialize)]
struct Info {
name: Option<String>,
version: Option<String>,
summary: Option<String>,
home_page: Option<String>,
license: Option<String>,
author: Option<String>,
author_email: Option<String>,
maintainer: Option<String>,
requires_python: Option<String>,
requires_dist: Option<Vec<String>>,
keywords: Option<String>,
classifiers: Option<Vec<String>>,
yanked: Option<bool>,
yanked_reason: Option<String>,
project_urls: Option<std::collections::BTreeMap<String, String>>,
}
#[derive(Deserialize)]
struct File {
upload_time: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_project_urls() {
assert!(matches("https://pypi.org/project/requests/"));
assert!(matches("https://pypi.org/project/numpy/1.26.0/"));
assert!(!matches("https://pypi.org/"));
assert!(!matches("https://example.com/project/foo"));
}
#[test]
fn parse_project_pulls_name_and_version() {
assert_eq!(
parse_project("https://pypi.org/project/requests/"),
Some(("requests".into(), None))
);
assert_eq!(
parse_project("https://pypi.org/project/numpy/1.26.0/"),
Some(("numpy".into(), Some("1.26.0".into())))
);
assert_eq!(
parse_project("https://pypi.org/project/scikit-learn/?foo=bar"),
Some(("scikit-learn".into(), None))
);
}
}

View file

@ -0,0 +1,234 @@
//! Reddit structured extractor — returns the full post + comment tree
//! as typed JSON via Reddit's `.json` API.
//!
//! The same trick the markdown extractor in `crate::reddit` uses:
//! appending `.json` to any post URL returns the data the new SPA
//! frontend would load client-side. Zero antibot, zero JS rendering.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "reddit",
label: "Reddit thread",
description: "Returns post + nested comment tree with scores, authors, and timestamps.",
url_patterns: &[
"https://www.reddit.com/r/*/comments/*",
"https://reddit.com/r/*/comments/*",
"https://old.reddit.com/r/*/comments/*",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
let is_reddit_host = matches!(
host,
"reddit.com" | "www.reddit.com" | "old.reddit.com" | "np.reddit.com" | "new.reddit.com"
);
is_reddit_host && url.contains("/comments/")
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let json_url = build_json_url(url);
let resp = client.fetch(&json_url).await?;
if resp.status != 200 {
return Err(FetchError::Build(format!(
"reddit api returned status {}",
resp.status
)));
}
let listings: Vec<Listing> = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("reddit json parse: {e}")))?;
if listings.is_empty() {
return Err(FetchError::BodyDecode("reddit response empty".into()));
}
// First listing = the post (single t3 child).
let post = listings
.first()
.and_then(|l| l.data.children.first())
.filter(|t| t.kind == "t3")
.map(|t| post_json(&t.data))
.unwrap_or(Value::Null);
// Second listing = the comment tree.
let comments: Vec<Value> = listings
.get(1)
.map(|l| l.data.children.iter().filter_map(comment_json).collect())
.unwrap_or_default();
Ok(json!({
"url": url,
"post": post,
"comments": comments,
}))
}
// ---------------------------------------------------------------------------
// JSON shapers
// ---------------------------------------------------------------------------
fn post_json(d: &ThingData) -> Value {
json!({
"id": d.id,
"title": d.title,
"author": d.author,
"subreddit": d.subreddit_name_prefixed,
"permalink": d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
"url": d.url_overridden_by_dest,
"is_self": d.is_self,
"selftext": d.selftext,
"score": d.score,
"upvote_ratio": d.upvote_ratio,
"num_comments": d.num_comments,
"created_utc": d.created_utc,
"link_flair_text": d.link_flair_text,
"over_18": d.over_18,
"spoiler": d.spoiler,
"stickied": d.stickied,
"locked": d.locked,
})
}
/// Render a single comment + its reply tree. Returns `None` for non-t1
/// kinds (the trailing `more` placeholder Reddit injects at depth limits).
fn comment_json(thing: &Thing) -> Option<Value> {
if thing.kind != "t1" {
return None;
}
let d = &thing.data;
let replies: Vec<Value> = match &d.replies {
Some(Replies::Listing(l)) => l.data.children.iter().filter_map(comment_json).collect(),
_ => Vec::new(),
};
Some(json!({
"id": d.id,
"author": d.author,
"body": d.body,
"score": d.score,
"created_utc": d.created_utc,
"is_submitter": d.is_submitter,
"stickied": d.stickied,
"depth": d.depth,
"permalink": d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
"replies": replies,
}))
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
/// Build the Reddit JSON URL. We keep the original host (`www.reddit.com`
/// or `old.reddit.com` as the caller gave us). Routing through
/// `old.reddit.com` unconditionally looks appealing but that host has
/// stricter UA-based blocking than `www.reddit.com`, while the main
/// host accepts our Chrome-fingerprinted client fine.
fn build_json_url(url: &str) -> String {
let clean = url.split('?').next().unwrap_or(url).trim_end_matches('/');
format!("{clean}.json?raw_json=1")
}
// ---------------------------------------------------------------------------
// Reddit JSON types — only fields we render. Everything else is dropped.
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct Listing {
data: ListingData,
}
#[derive(Deserialize)]
struct ListingData {
children: Vec<Thing>,
}
#[derive(Deserialize)]
struct Thing {
kind: String,
data: ThingData,
}
#[derive(Deserialize, Default)]
struct ThingData {
// post (t3)
id: Option<String>,
title: Option<String>,
selftext: Option<String>,
subreddit_name_prefixed: Option<String>,
url_overridden_by_dest: Option<String>,
is_self: Option<bool>,
upvote_ratio: Option<f64>,
num_comments: Option<i64>,
over_18: Option<bool>,
spoiler: Option<bool>,
stickied: Option<bool>,
locked: Option<bool>,
link_flair_text: Option<String>,
// comment (t1)
author: Option<String>,
body: Option<String>,
score: Option<i64>,
created_utc: Option<f64>,
is_submitter: Option<bool>,
depth: Option<i64>,
permalink: Option<String>,
// recursive
replies: Option<Replies>,
}
#[derive(Deserialize)]
#[serde(untagged)]
enum Replies {
Listing(Listing),
#[allow(dead_code)]
Empty(String),
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_reddit_post_urls() {
assert!(matches(
"https://www.reddit.com/r/rust/comments/abc123/some_title/"
));
assert!(matches(
"https://reddit.com/r/rust/comments/abc123/some_title"
));
assert!(matches("https://old.reddit.com/r/rust/comments/abc123/x/"));
}
#[test]
fn rejects_non_post_reddit_urls() {
assert!(!matches("https://www.reddit.com/r/rust"));
assert!(!matches("https://www.reddit.com/user/foo"));
assert!(!matches("https://example.com/r/rust/comments/x"));
}
#[test]
fn json_url_appends_suffix_and_drops_query() {
assert_eq!(
build_json_url("https://www.reddit.com/r/rust/comments/abc/x/?utm=foo"),
"https://www.reddit.com/r/rust/comments/abc/x.json?raw_json=1"
);
}
}

View file

@ -0,0 +1,242 @@
//! Shopify collection structured extractor.
//!
//! Every Shopify store exposes `/collections/{handle}.json` and
//! `/collections/{handle}/products.json` on the public surface. This
//! extractor hits `.json` (collection metadata) and falls through to
//! `/products.json` for the first page of products. Same caveat as
//! `shopify_product`: stores with Cloudflare in front of the shop
//! will 403 the public path.
//!
//! Explicit-call only (like `shopify_product`). `/collections/{slug}`
//! is a URL shape used by non-Shopify stores too, so auto-dispatch
//! would claim too many URLs.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "shopify_collection",
label: "Shopify collection",
description: "Returns collection metadata + first page of products (handle, title, vendor, price, available) on ANY Shopify store via /collections/{handle}.json + /products.json.",
url_patterns: &[
"https://{shop}/collections/{handle}",
"https://{shop}.myshopify.com/collections/{handle}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
return false;
}
url.contains("/collections/") && !url.ends_with("/collections/")
}
const NON_SHOPIFY_HOSTS: &[&str] = &[
"amazon.com",
"amazon.co.uk",
"amazon.de",
"ebay.com",
"etsy.com",
"walmart.com",
"target.com",
"aliexpress.com",
"huggingface.co", // has /collections/ for models
"github.com",
];
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let (coll_meta_url, coll_products_url) = build_json_urls(url);
// Step 1: collection metadata. Shopify returns 200 on missing
// collections sometimes; check "collection" key below.
let meta_resp = client.fetch(&coll_meta_url).await?;
if meta_resp.status == 404 {
return Err(FetchError::Build(format!(
"shopify_collection: '{url}' not found"
)));
}
if meta_resp.status == 403 {
return Err(FetchError::Build(format!(
"shopify_collection: {coll_meta_url} returned 403. The store has antibot in front of the .json endpoint. Use /v1/scrape/ecommerce_product or api.webclaw.io for this store."
)));
}
if meta_resp.status != 200 {
return Err(FetchError::Build(format!(
"shopify returned status {} for {coll_meta_url}",
meta_resp.status
)));
}
let meta: MetaWrapper = serde_json::from_str(&meta_resp.html).map_err(|e| {
FetchError::BodyDecode(format!(
"shopify_collection: '{url}' didn't return Shopify JSON, likely not a Shopify store ({e})"
))
})?;
// Step 2: first page of products for this collection.
let products = match client.fetch(&coll_products_url).await {
Ok(r) if r.status == 200 => serde_json::from_str::<ProductsWrapper>(&r.html)
.ok()
.map(|pw| pw.products)
.unwrap_or_default(),
_ => Vec::new(),
};
let product_summaries: Vec<Value> = products
.iter()
.map(|p| {
let first_variant = p.variants.first();
json!({
"id": p.id,
"handle": p.handle,
"title": p.title,
"vendor": p.vendor,
"product_type": p.product_type,
"price": first_variant.and_then(|v| v.price.clone()),
"compare_at_price":first_variant.and_then(|v| v.compare_at_price.clone()),
"available": p.variants.iter().any(|v| v.available.unwrap_or(false)),
"variant_count": p.variants.len(),
"image": p.images.first().and_then(|i| i.src.clone()),
"created_at": p.created_at,
"updated_at": p.updated_at,
})
})
.collect();
let c = meta.collection;
Ok(json!({
"url": url,
"meta_json_url": coll_meta_url,
"products_json_url": coll_products_url,
"collection_id": c.id,
"handle": c.handle,
"title": c.title,
"description_html": c.body_html,
"published_at": c.published_at,
"updated_at": c.updated_at,
"sort_order": c.sort_order,
"products_in_page": product_summaries.len(),
"products": product_summaries,
}))
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
/// Build `(collection.json, collection/products.json)` from a user URL.
fn build_json_urls(url: &str) -> (String, String) {
let (path_part, _query_part) = match url.split_once('?') {
Some((a, b)) => (a, Some(b)),
None => (url, None),
};
let clean = path_part.trim_end_matches('/').trim_end_matches(".json");
(
format!("{clean}.json"),
format!("{clean}/products.json?limit=50"),
)
}
// ---------------------------------------------------------------------------
// Shopify collection + product JSON shapes (subsets)
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct MetaWrapper {
collection: Collection,
}
#[derive(Deserialize)]
struct Collection {
id: Option<i64>,
handle: Option<String>,
title: Option<String>,
body_html: Option<String>,
published_at: Option<String>,
updated_at: Option<String>,
sort_order: Option<String>,
}
#[derive(Deserialize)]
struct ProductsWrapper {
#[serde(default)]
products: Vec<ProductSummary>,
}
#[derive(Deserialize)]
struct ProductSummary {
id: Option<i64>,
handle: Option<String>,
title: Option<String>,
vendor: Option<String>,
product_type: Option<String>,
created_at: Option<String>,
updated_at: Option<String>,
#[serde(default)]
variants: Vec<VariantSummary>,
#[serde(default)]
images: Vec<ImageSummary>,
}
#[derive(Deserialize)]
struct VariantSummary {
price: Option<String>,
compare_at_price: Option<String>,
available: Option<bool>,
}
#[derive(Deserialize)]
struct ImageSummary {
src: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_shopify_collection_urls() {
assert!(matches("https://www.allbirds.com/collections/mens"));
assert!(matches(
"https://shop.example.com/collections/new-arrivals?page=2"
));
}
#[test]
fn rejects_non_shopify() {
assert!(!matches("https://github.com/collections/foo"));
assert!(!matches("https://huggingface.co/collections/foo"));
assert!(!matches("https://example.com/"));
assert!(!matches("https://example.com/collections/"));
}
#[test]
fn build_json_urls_derives_both_paths() {
let (meta, products) = build_json_urls("https://shop.example.com/collections/mens");
assert_eq!(meta, "https://shop.example.com/collections/mens.json");
assert_eq!(
products,
"https://shop.example.com/collections/mens/products.json?limit=50"
);
}
#[test]
fn build_json_urls_handles_trailing_slash() {
let (meta, _) = build_json_urls("https://shop.example.com/collections/mens/");
assert_eq!(meta, "https://shop.example.com/collections/mens.json");
}
}

View file

@ -0,0 +1,318 @@
//! Shopify product structured extractor.
//!
//! Every Shopify store exposes a public JSON endpoint for each product
//! by appending `.json` to the product URL:
//!
//! https://shop.example.com/products/cool-tshirt
//! → https://shop.example.com/products/cool-tshirt.json
//!
//! There are ~4 million Shopify stores. The `.json` endpoint is
//! undocumented but has been stable for 10+ years. When a store puts
//! Cloudflare / antibot in front of the shop, this path can 403 just
//! like any other — for those cases the caller should fall back to
//! `ecommerce_product` (JSON-LD) or the cloud tier.
//!
//! This extractor is **explicit-call only** — it is NOT auto-dispatched
//! from `/v1/scrape` because we cannot tell ahead of time whether an
//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit
//! `/v1/scrape/shopify_product` when they know.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "shopify_product",
label: "Shopify product",
description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.",
url_patterns: &[
"https://{shop}/products/{handle}",
"https://{shop}.myshopify.com/products/{handle}",
],
};
pub fn matches(url: &str) -> bool {
// Any URL whose path contains /products/{something}. We do not
// filter by host — Shopify powers custom-domain stores. The
// extractor's /.json fallback is what confirms Shopify; `matches`
// just says "this is a plausible shape." Still reject obviously
// non-Shopify known hosts to save a failed request.
let host = host_of(url);
if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
return false;
}
url.contains("/products/") && !url.ends_with("/products/")
}
/// Hosts we know are not Shopify — reject so we don't burn a request.
const NON_SHOPIFY_HOSTS: &[&str] = &[
"amazon.com",
"amazon.co.uk",
"amazon.de",
"amazon.fr",
"amazon.it",
"ebay.com",
"etsy.com",
"walmart.com",
"target.com",
"aliexpress.com",
"bestbuy.com",
"wayfair.com",
"homedepot.com",
"github.com", // /products is a marketing page
];
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let json_url = build_json_url(url);
let resp = client.fetch(&json_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"shopify_product: '{url}' not found (got 404 from {json_url})"
)));
}
if resp.status == 403 {
return Err(FetchError::Build(format!(
"shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback."
)));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"shopify returned status {} for {json_url}",
resp.status
)));
}
let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| {
FetchError::BodyDecode(format!(
"shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})"
))
})?;
let p = body.product;
let variants: Vec<Value> = p
.variants
.iter()
.map(|v| {
json!({
"id": v.id,
"title": v.title,
"sku": v.sku,
"barcode": v.barcode,
"price": v.price,
"compare_at_price": v.compare_at_price,
"available": v.available,
"inventory_quantity": v.inventory_quantity,
"position": v.position,
"weight": v.weight,
"weight_unit": v.weight_unit,
"requires_shipping": v.requires_shipping,
"taxable": v.taxable,
"option1": v.option1,
"option2": v.option2,
"option3": v.option3,
})
})
.collect();
let images: Vec<Value> = p
.images
.iter()
.map(|i| {
json!({
"src": i.src,
"width": i.width,
"height": i.height,
"position": i.position,
"alt": i.alt,
})
})
.collect();
let options: Vec<Value> = p
.options
.iter()
.map(|o| json!({"name": o.name, "values": o.values, "position": o.position}))
.collect();
// Price range + availability summary across variants (the shape
// agents typically want without walking the variants array).
let prices: Vec<f64> = p
.variants
.iter()
.filter_map(|v| v.price.as_deref().and_then(|s| s.parse::<f64>().ok()))
.collect();
let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false));
Ok(json!({
"url": url,
"json_url": json_url,
"product_id": p.id,
"handle": p.handle,
"title": p.title,
"vendor": p.vendor,
"product_type": p.product_type,
"tags": p.tags,
"description_html":p.body_html,
"published_at": p.published_at,
"created_at": p.created_at,
"updated_at": p.updated_at,
"variant_count": variants.len(),
"image_count": images.len(),
"any_available": any_available,
"price_min": prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)),
"price_max": prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)),
"variants": variants,
"images": images,
"options": options,
}))
}
/// Build the .json path from a product URL. Handles pre-.jsoned URLs,
/// trailing slashes, and query strings.
fn build_json_url(url: &str) -> String {
let (path_part, query_part) = match url.split_once('?') {
Some((a, b)) => (a, Some(b)),
None => (url, None),
};
let clean = path_part.trim_end_matches('/');
let with_json = if clean.ends_with(".json") {
clean.to_string()
} else {
format!("{clean}.json")
};
match query_part {
Some(q) => format!("{with_json}?{q}"),
None => with_json,
}
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
// ---------------------------------------------------------------------------
// Shopify product JSON shape (a subset of the full response)
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct Wrapper {
product: Product,
}
#[derive(Deserialize)]
struct Product {
id: Option<i64>,
title: Option<String>,
handle: Option<String>,
vendor: Option<String>,
product_type: Option<String>,
body_html: Option<String>,
published_at: Option<String>,
created_at: Option<String>,
updated_at: Option<String>,
#[serde(default)]
tags: serde_json::Value, // array OR comma-joined string depending on store
#[serde(default)]
variants: Vec<Variant>,
#[serde(default)]
images: Vec<Image>,
#[serde(default)]
options: Vec<Option_>,
}
#[derive(Deserialize)]
struct Variant {
id: Option<i64>,
title: Option<String>,
sku: Option<String>,
barcode: Option<String>,
price: Option<String>,
compare_at_price: Option<String>,
available: Option<bool>,
inventory_quantity: Option<i64>,
position: Option<i64>,
weight: Option<f64>,
weight_unit: Option<String>,
requires_shipping: Option<bool>,
taxable: Option<bool>,
option1: Option<String>,
option2: Option<String>,
option3: Option<String>,
}
#[derive(Deserialize)]
struct Image {
src: Option<String>,
width: Option<i64>,
height: Option<i64>,
position: Option<i64>,
alt: Option<String>,
}
#[derive(Deserialize)]
#[serde(rename_all = "lowercase")]
struct Option_ {
name: Option<String>,
position: Option<i64>,
#[serde(default)]
values: Vec<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_plausible_shopify_urls() {
assert!(matches(
"https://www.allbirds.com/products/mens-tree-runners"
));
assert!(matches(
"https://shop.example.com/products/cool-tshirt?variant=123"
));
assert!(matches("https://somestore.myshopify.com/products/thing-1"));
}
#[test]
fn rejects_known_non_shopify() {
assert!(!matches("https://www.amazon.com/dp/B0C123"));
assert!(!matches("https://www.etsy.com/listing/12345/foo"));
assert!(!matches("https://www.amazon.co.uk/products/thing"));
assert!(!matches("https://github.com/products"));
}
#[test]
fn rejects_non_product_urls() {
assert!(!matches("https://example.com/"));
assert!(!matches("https://example.com/products/"));
assert!(!matches("https://example.com/collections/all"));
}
#[test]
fn build_json_url_handles_slash_and_query() {
assert_eq!(
build_json_url("https://shop.example.com/products/foo"),
"https://shop.example.com/products/foo.json"
);
assert_eq!(
build_json_url("https://shop.example.com/products/foo/"),
"https://shop.example.com/products/foo.json"
);
assert_eq!(
build_json_url("https://shop.example.com/products/foo?variant=123"),
"https://shop.example.com/products/foo.json?variant=123"
);
assert_eq!(
build_json_url("https://shop.example.com/products/foo.json"),
"https://shop.example.com/products/foo.json"
);
}
}

View file

@ -0,0 +1,216 @@
//! Stack Overflow Q&A structured extractor.
//!
//! Uses the Stack Exchange API at `api.stackexchange.com/2.3/questions/{id}`
//! with `site=stackoverflow`. Two calls: one for the question, one for
//! its answers. Both come pre-filtered to include the rendered HTML body
//! so we don't re-parse the question page itself.
//!
//! Anonymous access caps at 300 requests per IP per day. Production
//! cloud should set `STACKAPPS_KEY` to lift to 10,000/day, but we don't
//! require it to work out of the box.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "stackoverflow",
label: "Stack Overflow Q&A",
description: "Returns question + answers: title, body, tags, votes, accepted answer, top answers.",
url_patterns: &["https://stackoverflow.com/questions/{id}/{slug}"],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host != "stackoverflow.com" && host != "www.stackoverflow.com" {
return false;
}
parse_question_id(url).is_some()
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let id = parse_question_id(url).ok_or_else(|| {
FetchError::Build(format!(
"stackoverflow: cannot parse question id from '{url}'"
))
})?;
// Filter `withbody` includes the rendered HTML body for both questions
// and answers. Stack Exchange's filter system is documented at
// api.stackexchange.com/docs/filters.
let q_url = format!(
"https://api.stackexchange.com/2.3/questions/{id}?site=stackoverflow&filter=withbody"
);
let q_resp = client.fetch(&q_url).await?;
if q_resp.status != 200 {
return Err(FetchError::Build(format!(
"stackexchange api returned status {}",
q_resp.status
)));
}
let q_body: QResponse = serde_json::from_str(&q_resp.html)
.map_err(|e| FetchError::BodyDecode(format!("stackoverflow q parse: {e}")))?;
let q = q_body
.items
.first()
.ok_or_else(|| FetchError::Build(format!("stackoverflow: question {id} not found")))?;
let a_url = format!(
"https://api.stackexchange.com/2.3/questions/{id}/answers?site=stackoverflow&filter=withbody&order=desc&sort=votes"
);
let a_resp = client.fetch(&a_url).await?;
let answers = if a_resp.status == 200 {
let a_body: AResponse = serde_json::from_str(&a_resp.html)
.map_err(|e| FetchError::BodyDecode(format!("stackoverflow a parse: {e}")))?;
a_body
.items
.iter()
.map(|a| {
json!({
"answer_id": a.answer_id,
"is_accepted": a.is_accepted,
"score": a.score,
"body": a.body,
"creation_date": a.creation_date,
"last_edit_date":a.last_edit_date,
"author": a.owner.as_ref().and_then(|o| o.display_name.clone()),
"author_rep": a.owner.as_ref().and_then(|o| o.reputation),
})
})
.collect::<Vec<_>>()
} else {
Vec::new()
};
let accepted = answers
.iter()
.find(|a| {
a.get("is_accepted")
.and_then(|v| v.as_bool())
.unwrap_or(false)
})
.cloned();
Ok(json!({
"url": url,
"question_id": q.question_id,
"title": q.title,
"body": q.body,
"tags": q.tags,
"score": q.score,
"view_count": q.view_count,
"answer_count": q.answer_count,
"is_answered": q.is_answered,
"accepted_answer_id": q.accepted_answer_id,
"creation_date": q.creation_date,
"last_activity_date": q.last_activity_date,
"author": q.owner.as_ref().and_then(|o| o.display_name.clone()),
"author_rep": q.owner.as_ref().and_then(|o| o.reputation),
"link": q.link,
"accepted_answer": accepted,
"top_answers": answers,
}))
}
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
/// Parse question id from a URL of the form `/questions/{id}/{slug}`.
fn parse_question_id(url: &str) -> Option<u64> {
let after = url.split("/questions/").nth(1)?;
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
let first = stripped.split('/').next()?;
first.parse::<u64>().ok()
}
// ---------------------------------------------------------------------------
// Stack Exchange API types
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct QResponse {
#[serde(default)]
items: Vec<Question>,
}
#[derive(Deserialize)]
struct Question {
question_id: Option<u64>,
title: Option<String>,
body: Option<String>,
#[serde(default)]
tags: Vec<String>,
score: Option<i64>,
view_count: Option<i64>,
answer_count: Option<i64>,
is_answered: Option<bool>,
accepted_answer_id: Option<u64>,
creation_date: Option<i64>,
last_activity_date: Option<i64>,
owner: Option<Owner>,
link: Option<String>,
}
#[derive(Deserialize)]
struct AResponse {
#[serde(default)]
items: Vec<Answer>,
}
#[derive(Deserialize)]
struct Answer {
answer_id: Option<u64>,
is_accepted: Option<bool>,
score: Option<i64>,
body: Option<String>,
creation_date: Option<i64>,
last_edit_date: Option<i64>,
owner: Option<Owner>,
}
#[derive(Deserialize)]
struct Owner {
display_name: Option<String>,
reputation: Option<i64>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_question_urls() {
assert!(matches(
"https://stackoverflow.com/questions/12345/some-slug"
));
assert!(matches(
"https://stackoverflow.com/questions/12345/some-slug?answertab=votes"
));
assert!(!matches("https://stackoverflow.com/"));
assert!(!matches("https://stackoverflow.com/questions"));
assert!(!matches("https://stackoverflow.com/users/100"));
assert!(!matches("https://example.com/questions/12345/x"));
}
#[test]
fn parse_question_id_handles_slug_and_query() {
assert_eq!(
parse_question_id("https://stackoverflow.com/questions/12345/some-slug"),
Some(12345)
);
assert_eq!(
parse_question_id("https://stackoverflow.com/questions/12345/some-slug?tab=newest"),
Some(12345)
);
assert_eq!(parse_question_id("https://stackoverflow.com/foo"), None);
}
}

View file

@ -0,0 +1,565 @@
//! Substack post extractor.
//!
//! Every Substack publication exposes `/api/v1/posts/{slug}` that
//! returns the full post as JSON: body HTML, cover image, author,
//! publication info, reactions, paywall state. No auth on public
//! posts.
//!
//! Works on both `*.substack.com` subdomains and custom domains
//! (e.g. `simonwillison.net` uses Substack too). Detection is
//! "URL has `/p/{slug}`" because that's the canonical Substack post
//! path. Explicit-call only because the `/p/{slug}` URL shape is
//! used by non-Substack sites too.
//!
//! ## Fallback
//!
//! The API endpoint is rate-limited aggressively on popular publications
//! and occasionally returns 403 on custom domains with Cloudflare in
//! front. When that happens we escalate to an HTML fetch (via
//! `smart_fetch_html`, so antibot-protected custom domains still work)
//! and extract OG tags + Article JSON-LD for a degraded-but-useful
//! payload. The response shape stays stable across both paths; a
//! `data_source` field tells the caller which branch ran.
use std::sync::OnceLock;
use regex::Regex;
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::cloud::{self, CloudError};
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "substack_post",
label: "Substack post",
description: "Returns post HTML, title, subtitle, author, publication, reactions, paywall status via the Substack public API. Falls back to OG + JSON-LD HTML parsing when the API is rate-limited.",
url_patterns: &[
"https://{pub}.substack.com/p/{slug}",
"https://{custom-domain}/p/{slug}",
],
};
pub fn matches(url: &str) -> bool {
if !(url.starts_with("http://") || url.starts_with("https://")) {
return false;
}
url.contains("/p/")
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let slug = parse_slug(url).ok_or_else(|| {
FetchError::Build(format!("substack_post: cannot parse slug from '{url}'"))
})?;
let host = host_of(url);
if host.is_empty() {
return Err(FetchError::Build(format!(
"substack_post: empty host in '{url}'"
)));
}
let scheme = if url.starts_with("http://") {
"http"
} else {
"https"
};
let api_url = format!("{scheme}://{host}/api/v1/posts/{slug}");
// 1. Try the public API. 200 = full payload; 404 = real miss; any
// other status hands off to the HTML fallback so a transient rate
// limit or a hardened custom domain doesn't fail the whole call.
let resp = client.fetch(&api_url).await?;
match resp.status {
200 => match serde_json::from_str::<Post>(&resp.html) {
Ok(p) => Ok(build_api_payload(url, &api_url, &slug, p)),
Err(e) => {
// API returned 200 but the body isn't the Post shape we
// expect. Could be a custom-domain site that exposes
// something else at /api/v1/posts/. Fall back to HTML
// rather than hard-failing.
html_fallback(
client,
url,
&api_url,
&slug,
Some(format!(
"api returned 200 but body was not Substack JSON ({e})"
)),
)
.await
}
},
404 => Err(FetchError::Build(format!(
"substack_post: '{slug}' not found on {host} (got 404). \
If the publication isn't actually on Substack, use /v1/scrape instead."
))),
_ => {
// Rate limit, 403, 5xx, whatever: try HTML.
let reason = format!("api returned status {} for {api_url}", resp.status);
html_fallback(client, url, &api_url, &slug, Some(reason)).await
}
}
}
// ---------------------------------------------------------------------------
// API-path payload builder
// ---------------------------------------------------------------------------
fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value {
json!({
"url": url,
"api_url": api_url,
"data_source": "api",
"id": p.id,
"type": p.r#type,
"slug": p.slug.or_else(|| Some(slug.to_string())),
"title": p.title,
"subtitle": p.subtitle,
"description": p.description,
"canonical_url": p.canonical_url,
"post_date": p.post_date,
"updated_at": p.updated_at,
"audience": p.audience,
"has_paywall": matches!(p.audience.as_deref(), Some("only_paid") | Some("founding")),
"is_free_preview": p.is_free_preview,
"cover_image": p.cover_image,
"word_count": p.wordcount,
"reactions": p.reactions,
"comment_count": p.comment_count,
"body_html": p.body_html,
"body_text": p.truncated_body_text.or(p.body_text),
"publication": json!({
"id": p.publication.as_ref().and_then(|pub_| pub_.id),
"name": p.publication.as_ref().and_then(|pub_| pub_.name.clone()),
"subdomain": p.publication.as_ref().and_then(|pub_| pub_.subdomain.clone()),
"custom_domain":p.publication.as_ref().and_then(|pub_| pub_.custom_domain.clone()),
}),
"authors": p.published_bylines.iter().map(|a| json!({
"id": a.id,
"name": a.name,
"handle": a.handle,
"photo": a.photo_url,
})).collect::<Vec<_>>(),
})
}
// ---------------------------------------------------------------------------
// HTML fallback: OG + Article JSON-LD
// ---------------------------------------------------------------------------
async fn html_fallback(
client: &dyn Fetcher,
url: &str,
api_url: &str,
slug: &str,
fallback_reason: Option<String>,
) -> Result<Value, FetchError> {
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
.await
.map_err(cloud_to_fetch_err)?;
let mut data = parse_html(&fetched.html, url, api_url, slug);
if let Some(obj) = data.as_object_mut() {
obj.insert(
"fetch_source".into(),
match fetched.source {
cloud::FetchSource::Local => json!("local"),
cloud::FetchSource::Cloud => json!("cloud"),
},
);
if let Some(reason) = fallback_reason {
obj.insert("fallback_reason".into(), json!(reason));
}
}
Ok(data)
}
/// Pure HTML parser. Pulls title, subtitle, description, cover image,
/// publish date, and authors from OG tags and Article JSON-LD. Kept
/// public so tests can exercise it with fixtures.
pub fn parse_html(html: &str, url: &str, api_url: &str, slug: &str) -> Value {
let article = find_article_jsonld(html);
let title = article
.as_ref()
.and_then(|v| get_text(v, "headline"))
.or_else(|| og(html, "title"));
let description = article
.as_ref()
.and_then(|v| get_text(v, "description"))
.or_else(|| og(html, "description"));
let cover_image = article
.as_ref()
.and_then(get_first_image)
.or_else(|| og(html, "image"));
let post_date = article
.as_ref()
.and_then(|v| get_text(v, "datePublished"))
.or_else(|| meta_property(html, "article:published_time"));
let updated_at = article.as_ref().and_then(|v| get_text(v, "dateModified"));
let publication_name = og(html, "site_name");
let authors = article.as_ref().map(extract_authors).unwrap_or_default();
json!({
"url": url,
"api_url": api_url,
"data_source": "html_fallback",
"slug": slug,
"title": title,
"subtitle": None::<String>,
"description": description,
"canonical_url": canonical_url(html).or_else(|| Some(url.to_string())),
"post_date": post_date,
"updated_at": updated_at,
"cover_image": cover_image,
"body_html": None::<String>,
"body_text": None::<String>,
"word_count": None::<i64>,
"comment_count": None::<i64>,
"reactions": Value::Null,
"has_paywall": None::<bool>,
"is_free_preview": None::<bool>,
"publication": json!({
"name": publication_name,
}),
"authors": authors,
})
}
fn extract_authors(v: &Value) -> Vec<Value> {
let Some(a) = v.get("author") else {
return Vec::new();
};
let one = |val: &Value| -> Option<Value> {
match val {
Value::String(s) => Some(json!({"name": s})),
Value::Object(_) => {
let name = val.get("name").and_then(|n| n.as_str())?;
let handle = val
.get("url")
.and_then(|u| u.as_str())
.and_then(handle_from_author_url);
Some(json!({
"name": name,
"handle": handle,
}))
}
_ => None,
}
};
match a {
Value::Array(arr) => arr.iter().filter_map(one).collect(),
_ => one(a).into_iter().collect(),
}
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
fn parse_slug(url: &str) -> Option<String> {
let after = url.split("/p/").nth(1)?;
let stripped = after
.split(['?', '#'])
.next()?
.trim_end_matches('/')
.split('/')
.next()
.unwrap_or("");
if stripped.is_empty() {
None
} else {
Some(stripped.to_string())
}
}
/// Extract the Substack handle from an author URL like
/// `https://substack.com/@handle` or `https://pub.substack.com/@handle`.
///
/// Returns `None` when the URL has no `@` segment (e.g. a non-Substack
/// author page) so we don't synthesise a fake handle.
fn handle_from_author_url(u: &str) -> Option<String> {
let after = u.rsplit_once('@').map(|(_, tail)| tail)?;
let clean = after.split(['/', '?', '#']).next()?;
if clean.is_empty() {
None
} else {
Some(clean.to_string())
}
}
// ---------------------------------------------------------------------------
// HTML tag helpers
// ---------------------------------------------------------------------------
fn og(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
return c.get(2).map(|m| m.as_str().to_string());
}
}
None
}
/// Pull `<meta property="article:published_time" content="...">` and
/// similar structured meta tags.
fn meta_property(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
return c.get(2).map(|m| m.as_str().to_string());
}
}
None
}
fn canonical_url(html: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE
.get_or_init(|| Regex::new(r#"(?i)<link[^>]+rel="canonical"[^>]+href="([^"]+)""#).unwrap());
re.captures(html)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
// ---------------------------------------------------------------------------
// JSON-LD walkers (Article / NewsArticle)
// ---------------------------------------------------------------------------
fn find_article_jsonld(html: &str) -> Option<Value> {
let blocks = webclaw_core::structured_data::extract_json_ld(html);
for b in blocks {
if let Some(found) = find_article_in(&b) {
return Some(found);
}
}
None
}
fn find_article_in(v: &Value) -> Option<Value> {
if is_article_type(v) {
return Some(v.clone());
}
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
for item in graph {
if let Some(found) = find_article_in(item) {
return Some(found);
}
}
}
if let Some(arr) = v.as_array() {
for item in arr {
if let Some(found) = find_article_in(item) {
return Some(found);
}
}
}
None
}
fn is_article_type(v: &Value) -> bool {
let Some(t) = v.get("@type") else {
return false;
};
let is_art = |s: &str| {
matches!(
s,
"Article" | "NewsArticle" | "BlogPosting" | "SocialMediaPosting"
)
};
match t {
Value::String(s) => is_art(s),
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_art)),
_ => false,
}
}
fn get_text(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Number(n) => Some(n.to_string()),
_ => None,
})
}
fn get_first_image(v: &Value) -> Option<String> {
match v.get("image")? {
Value::String(s) => Some(s.clone()),
Value::Array(arr) => arr.iter().find_map(|x| match x {
Value::String(s) => Some(s.clone()),
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}),
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
_ => None,
}
}
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
FetchError::Build(e.to_string())
}
// ---------------------------------------------------------------------------
// Substack API types (subset)
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct Post {
id: Option<i64>,
r#type: Option<String>,
slug: Option<String>,
title: Option<String>,
subtitle: Option<String>,
description: Option<String>,
canonical_url: Option<String>,
post_date: Option<String>,
updated_at: Option<String>,
audience: Option<String>,
is_free_preview: Option<bool>,
cover_image: Option<String>,
wordcount: Option<i64>,
reactions: Option<serde_json::Value>,
comment_count: Option<i64>,
body_html: Option<String>,
body_text: Option<String>,
truncated_body_text: Option<String>,
publication: Option<Publication>,
#[serde(default, rename = "publishedBylines")]
published_bylines: Vec<Byline>,
}
#[derive(Deserialize)]
struct Publication {
id: Option<i64>,
name: Option<String>,
subdomain: Option<String>,
custom_domain: Option<String>,
}
#[derive(Deserialize)]
struct Byline {
id: Option<i64>,
name: Option<String>,
handle: Option<String>,
photo_url: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_post_urls() {
assert!(matches(
"https://stratechery.substack.com/p/the-tech-letter"
));
assert!(matches("https://simonwillison.net/p/2024-08-01-something"));
assert!(!matches("https://example.com/"));
assert!(!matches("ftp://example.com/p/foo"));
}
#[test]
fn parse_slug_strips_query_and_trailing_slash() {
assert_eq!(
parse_slug("https://example.substack.com/p/my-post"),
Some("my-post".into())
);
assert_eq!(
parse_slug("https://example.substack.com/p/my-post/"),
Some("my-post".into())
);
assert_eq!(
parse_slug("https://example.substack.com/p/my-post?ref=123"),
Some("my-post".into())
);
}
#[test]
fn parse_html_extracts_from_og_tags() {
let html = r##"
<html><head>
<meta property="og:title" content="My Great Post">
<meta property="og:description" content="A short summary.">
<meta property="og:image" content="https://cdn.substack.com/cover.jpg">
<meta property="og:site_name" content="My Publication">
<meta property="article:published_time" content="2025-09-01T10:00:00Z">
<link rel="canonical" href="https://mypub.substack.com/p/my-post">
</head></html>"##;
let v = parse_html(
html,
"https://mypub.substack.com/p/my-post",
"https://mypub.substack.com/api/v1/posts/my-post",
"my-post",
);
assert_eq!(v["data_source"], "html_fallback");
assert_eq!(v["title"], "My Great Post");
assert_eq!(v["description"], "A short summary.");
assert_eq!(v["cover_image"], "https://cdn.substack.com/cover.jpg");
assert_eq!(v["post_date"], "2025-09-01T10:00:00Z");
assert_eq!(v["publication"]["name"], "My Publication");
assert_eq!(v["canonical_url"], "https://mypub.substack.com/p/my-post");
}
#[test]
fn parse_html_prefers_jsonld_when_present() {
let html = r##"
<html><head>
<meta property="og:title" content="OG Title">
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"NewsArticle",
"headline":"JSON-LD Title",
"description":"JSON-LD desc.",
"image":"https://cdn.substack.com/hero.jpg",
"datePublished":"2025-10-12T08:30:00Z",
"dateModified":"2025-10-12T09:00:00Z",
"author":[{"@type":"Person","name":"Alice Author","url":"https://substack.com/@alice"}]}
</script>
</head></html>"##;
let v = parse_html(
html,
"https://example.com/p/a",
"https://example.com/api/v1/posts/a",
"a",
);
assert_eq!(v["title"], "JSON-LD Title");
assert_eq!(v["description"], "JSON-LD desc.");
assert_eq!(v["cover_image"], "https://cdn.substack.com/hero.jpg");
assert_eq!(v["post_date"], "2025-10-12T08:30:00Z");
assert_eq!(v["updated_at"], "2025-10-12T09:00:00Z");
assert_eq!(v["authors"][0]["name"], "Alice Author");
assert_eq!(v["authors"][0]["handle"], "alice");
}
#[test]
fn handle_from_author_url_pulls_handle() {
assert_eq!(
handle_from_author_url("https://substack.com/@alice"),
Some("alice".into())
);
assert_eq!(
handle_from_author_url("https://mypub.substack.com/@bob/"),
Some("bob".into())
);
assert_eq!(
handle_from_author_url("https://not-substack.com/author/carol"),
None
);
}
}

View file

@ -0,0 +1,572 @@
//! Trustpilot company reviews extractor.
//!
//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's
//! "Verifying your connection" interstitial, so this extractor always
//! routes through [`cloud::smart_fetch_html`]. Without
//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean
//! "set API key" error; with one it escalates to api.webclaw.io.
//!
//! ## 2025 JSON-LD schema
//!
//! Trustpilot replaced the old single-Organization + aggregateRating
//! shape with three separate JSON-LD blocks:
//!
//! 1. `Organization` block for Trustpilot the platform itself
//! (company info, addresses, social profiles). Not the business
//! being reviewed. We detect and skip this.
//! 2. `Dataset` block with a csvw:Table mainEntity that contains the
//! per-star-bucket counts for the target business plus a Total
//! column. The Dataset's `name` is the business display name.
//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated
//! summary of reviews plus the individual review objects
//! (consumer, dates, rating, title, text, language, likes).
//!
//! Plus `metadata.title` from the page head parses as
//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and
//! `metadata.description` carries `"{N} customers have already said"`.
//! We use both as extra signal when the Dataset block is absent.
use std::sync::OnceLock;
use regex::Regex;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::cloud::{self, CloudError};
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "trustpilot_reviews",
label: "Trustpilot reviews",
description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.",
url_patterns: &["https://www.trustpilot.com/review/{domain}"],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if !matches!(host, "www.trustpilot.com" | "trustpilot.com") {
return false;
}
url.contains("/review/")
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
.await
.map_err(cloud_to_fetch_err)?;
let mut data = parse(&fetched.html, url)?;
if let Some(obj) = data.as_object_mut() {
obj.insert(
"data_source".into(),
match fetched.source {
cloud::FetchSource::Local => json!("local"),
cloud::FetchSource::Cloud => json!("cloud"),
},
);
}
Ok(data)
}
/// Pure parser. Kept public so the cloud pipeline can reuse it on its
/// own fetched HTML without going through the async extract path.
pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
let domain = parse_review_domain(url).ok_or_else(|| {
FetchError::Build(format!(
"trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'"
))
})?;
let blocks = webclaw_core::structured_data::extract_json_ld(html);
// The business Dataset block has `about.@id` pointing to the target
// domain's Organization (e.g. `.../Organization/anthropic.com`).
let dataset = find_business_dataset(&blocks, &domain);
// The aiSummary block: not typed (no `@type`), detect by key.
let ai_block = find_ai_summary_block(&blocks);
// Business name: Dataset > metadata.title regex > URL domain.
let business_name = dataset
.as_ref()
.and_then(|d| get_string(d, "name"))
.or_else(|| parse_name_from_og_title(html))
.or_else(|| Some(domain.clone()));
// Rating distribution from the csvw:Table columns. Each column has
// csvw:name like "1 star" / "Total" and a single cell with the
// integer count.
let distribution = dataset.as_ref().and_then(parse_star_distribution);
let (rating_from_dist, total_from_dist) = distribution
.as_ref()
.map(compute_rating_stats)
.unwrap_or((None, None));
// Page-title / page-description fallbacks. OG title format:
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
let (rating_label, rating_from_og) = parse_rating_from_og_title(html);
let total_from_desc = parse_review_count_from_og_description(html);
// Recent reviews carried by the aiSummary block.
let recent_reviews: Vec<Value> = ai_block
.as_ref()
.and_then(|a| a.get("aiSummaryReviews"))
.and_then(|arr| arr.as_array())
.map(|arr| arr.iter().map(extract_review).collect())
.unwrap_or_default();
let ai_summary = ai_block
.as_ref()
.and_then(|a| a.get("aiSummary"))
.and_then(|s| s.get("summary"))
.and_then(|t| t.as_str())
.map(String::from);
Ok(json!({
"url": url,
"domain": domain,
"business_name": business_name,
"rating_label": rating_label,
"average_rating": rating_from_dist.or(rating_from_og),
"review_count": total_from_dist.or(total_from_desc),
"rating_distribution": distribution,
"ai_summary": ai_summary,
"recent_reviews": recent_reviews,
"review_count_listed": recent_reviews.len(),
}))
}
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
FetchError::Build(e.to_string())
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
/// Pull the target domain from `trustpilot.com/review/{domain}`.
fn parse_review_domain(url: &str) -> Option<String> {
let after = url.split("/review/").nth(1)?;
let stripped = after
.split(['?', '#'])
.next()?
.trim_end_matches('/')
.split('/')
.next()
.unwrap_or("");
if stripped.is_empty() {
None
} else {
Some(stripped.to_string())
}
}
// ---------------------------------------------------------------------------
// JSON-LD block walkers
// ---------------------------------------------------------------------------
/// Find the Dataset block whose `about.@id` references the target
/// domain's Organization. Falls through to any Dataset if the @id
/// check doesn't match (Trustpilot occasionally varies the URL).
fn find_business_dataset(blocks: &[Value], domain: &str) -> Option<Value> {
let mut fallback_any_dataset: Option<Value> = None;
for block in blocks {
for node in walk_graph(block) {
if !is_dataset(&node) {
continue;
}
if dataset_about_matches_domain(&node, domain) {
return Some(node);
}
if fallback_any_dataset.is_none() {
fallback_any_dataset = Some(node);
}
}
}
fallback_any_dataset
}
fn is_dataset(v: &Value) -> bool {
v.get("@type")
.and_then(|t| t.as_str())
.is_some_and(|s| s == "Dataset")
}
fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool {
let about_id = v
.get("about")
.and_then(|a| a.get("@id"))
.and_then(|id| id.as_str());
let Some(id) = about_id else {
return false;
};
id.contains(&format!("/Organization/{domain}"))
}
/// The aiSummary / aiSummaryReviews block has no `@type`, so match by
/// presence of the `aiSummary` key.
fn find_ai_summary_block(blocks: &[Value]) -> Option<Value> {
for block in blocks {
for node in walk_graph(block) {
if node.get("aiSummary").is_some() {
return Some(node);
}
}
}
None
}
/// Flatten each block (and its `@graph`) into a list of nodes we can
/// iterate over. Handles both `@graph: [ ... ]` (array) and
/// `@graph: { ... }` (single object) shapes — Trustpilot uses both.
fn walk_graph(block: &Value) -> Vec<Value> {
let mut out = vec![block.clone()];
if let Some(graph) = block.get("@graph") {
match graph {
Value::Array(arr) => out.extend(arr.iter().cloned()),
Value::Object(_) => out.push(graph.clone()),
_ => {}
}
}
out
}
// ---------------------------------------------------------------------------
// Rating distribution (csvw:Table)
// ---------------------------------------------------------------------------
/// Parse the per-star distribution from the Dataset block. Returns
/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`.
fn parse_star_distribution(dataset: &Value) -> Option<Value> {
let columns = dataset
.get("mainEntity")?
.get("csvw:tableSchema")?
.get("csvw:columns")?
.as_array()?;
let mut out = serde_json::Map::new();
for col in columns {
let name = col.get("csvw:name").and_then(|n| n.as_str())?;
let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?;
let count = cell
.get("csvw:value")
.and_then(|v| v.as_str())
.and_then(|s| s.parse::<i64>().ok());
let percent = cell
.get("csvw:notes")
.and_then(|n| n.as_array())
.and_then(|arr| arr.first())
.and_then(|s| s.as_str())
.map(String::from);
let key = normalise_star_key(name);
out.insert(
key,
json!({
"count": count,
"percent": percent,
}),
);
}
if out.is_empty() {
None
} else {
Some(Value::Object(out))
}
}
/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than
/// the raw "1 star" key which fights YAML/JS property access.
fn normalise_star_key(name: &str) -> String {
let trimmed = name.trim().to_lowercase();
match trimmed.as_str() {
"1 star" => "one_star".into(),
"2 stars" => "two_stars".into(),
"3 stars" => "three_stars".into(),
"4 stars" => "four_stars".into(),
"5 stars" => "five_stars".into(),
"total" => "total".into(),
other => other.replace(' ', "_"),
}
}
/// Compute average rating (weighted by bucket) and total count from the
/// parsed distribution. Returns `(average, total)`.
fn compute_rating_stats(distribution: &Value) -> (Option<String>, Option<i64>) {
let Some(obj) = distribution.as_object() else {
return (None, None);
};
let get_count = |key: &str| -> i64 {
obj.get(key)
.and_then(|v| v.get("count"))
.and_then(|v| v.as_i64())
.unwrap_or(0)
};
let one = get_count("one_star");
let two = get_count("two_stars");
let three = get_count("three_stars");
let four = get_count("four_stars");
let five = get_count("five_stars");
let total_bucket = one + two + three + four + five;
let total = obj
.get("total")
.and_then(|v| v.get("count"))
.and_then(|v| v.as_i64())
.unwrap_or(total_bucket);
if total == 0 {
return (None, Some(0));
}
let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5);
let avg = weighted as f64 / total_bucket.max(1) as f64;
// One decimal place, matching how Trustpilot displays the score.
(Some(format!("{avg:.1}")), Some(total))
}
// ---------------------------------------------------------------------------
// OG / meta-tag fallbacks
// ---------------------------------------------------------------------------
/// Regex out the business name from the standard Trustpilot OG title
/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`.
fn parse_name_from_og_title(html: &str) -> Option<String> {
let title = og(html, "title")?;
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap());
re.captures(&title)
.and_then(|c| c.get(1))
.map(|m| m.as_str().to_string())
}
/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value
/// from the OG title.
fn parse_rating_from_og_title(html: &str) -> (Option<String>, Option<String>) {
let Some(title) = og(html, "title") else {
return (None, None);
};
static RE: OnceLock<Regex> = OnceLock::new();
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
let re = RE.get_or_init(|| {
Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap()
});
let Some(caps) = re.captures(&title) else {
return (None, None);
};
(
caps.get(1).map(|m| m.as_str().trim().to_string()),
caps.get(2).map(|m| m.as_str().to_string()),
)
}
/// Parse "hear what 226 customers have already said" from the OG
/// description tag.
fn parse_review_count_from_og_description(html: &str) -> Option<i64> {
let desc = og(html, "description")?;
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap());
re.captures(&desc)?
.get(1)?
.as_str()
.replace(',', "")
.parse::<i64>()
.ok()
}
fn og(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
let raw = c.get(2).map(|m| m.as_str())?;
return Some(html_unescape(raw));
}
}
None
}
/// Minimal HTML entity unescaping for the three entities the
/// synthesize_html escaper might produce. Keeps us off a heavier dep.
fn html_unescape(s: &str) -> String {
s.replace("&quot;", "\"")
.replace("&amp;", "&")
.replace("&lt;", "<")
.replace("&gt;", ">")
}
fn get_string(v: &Value, key: &str) -> Option<String> {
v.get(key).and_then(|x| x.as_str().map(String::from))
}
// ---------------------------------------------------------------------------
// Review extraction
// ---------------------------------------------------------------------------
fn extract_review(r: &Value) -> Value {
json!({
"id": r.get("id").and_then(|v| v.as_str()),
"rating": r.get("rating").and_then(|v| v.as_i64()),
"title": r.get("title").and_then(|v| v.as_str()),
"text": r.get("text").and_then(|v| v.as_str()),
"language": r.get("language").and_then(|v| v.as_str()),
"source": r.get("source").and_then(|v| v.as_str()),
"likes": r.get("likes").and_then(|v| v.as_i64()),
"author": r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()),
"author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()),
"author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()),
"verified": r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()),
"date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()),
"date_published": r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()),
})
}
// ---------------------------------------------------------------------------
// Tests
// ---------------------------------------------------------------------------
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_trustpilot_review_urls() {
assert!(matches("https://www.trustpilot.com/review/stripe.com"));
assert!(matches("https://trustpilot.com/review/example.com"));
assert!(!matches("https://www.trustpilot.com/"));
assert!(!matches("https://example.com/review/foo"));
}
#[test]
fn parse_review_domain_handles_query_and_slash() {
assert_eq!(
parse_review_domain("https://www.trustpilot.com/review/anthropic.com"),
Some("anthropic.com".into())
);
assert_eq!(
parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"),
Some("anthropic.com".into())
);
assert_eq!(
parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"),
Some("anthropic.com".into())
);
}
#[test]
fn normalise_star_key_covers_all_buckets() {
assert_eq!(normalise_star_key("1 star"), "one_star");
assert_eq!(normalise_star_key("2 stars"), "two_stars");
assert_eq!(normalise_star_key("5 stars"), "five_stars");
assert_eq!(normalise_star_key("Total"), "total");
}
#[test]
fn compute_rating_stats_weighted_average() {
// 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews.
let dist = json!({
"one_star": { "count": 100, "percent": "50%" },
"two_stars": { "count": 0, "percent": "0%" },
"three_stars":{ "count": 0, "percent": "0%" },
"four_stars": { "count": 0, "percent": "0%" },
"five_stars": { "count": 100, "percent": "50%" },
"total": { "count": 200, "percent": "100%" },
});
let (avg, total) = compute_rating_stats(&dist);
assert_eq!(avg.as_deref(), Some("3.0"));
assert_eq!(total, Some(200));
}
#[test]
fn parse_og_title_extracts_name_and_rating() {
let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">"#;
assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into()));
let (label, rating) = parse_rating_from_og_title(html);
assert_eq!(label.as_deref(), Some("Bad"));
assert_eq!(rating.as_deref(), Some("1.5"));
}
#[test]
fn parse_review_count_from_og_description_picks_number() {
let html = r#"<meta property="og:description" content="Do you agree? Voice your opinion today and hear what 226 customers have already said.">"#;
assert_eq!(parse_review_count_from_og_description(html), Some(226));
}
#[test]
fn parse_full_fixture_assembles_all_fields() {
let html = r##"<html><head>
<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">
<script type="application/ld+json">
{"@context":"https://schema.org","@graph":[
{"@id":"https://www.trustpilot.com/#/schema/Organization/1","@type":"Organization","name":"Trustpilot"}
]}
</script>
<script type="application/ld+json">
{"@context":["https://schema.org",{"csvw":"http://www.w3.org/ns/csvw#"}],
"@graph":{"@id":"https://www.trustpilot.com/#/schema/DataSet/anthropic.com/1",
"@type":"Dataset",
"about":{"@id":"https://www.trustpilot.com/#/schema/Organization/anthropic.com"},
"name":"Anthropic",
"mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[
{"csvw:name":"1 star","csvw:cells":[{"csvw:value":"196","csvw:notes":["87%"]}]},
{"csvw:name":"2 stars","csvw:cells":[{"csvw:value":"9","csvw:notes":["4%"]}]},
{"csvw:name":"3 stars","csvw:cells":[{"csvw:value":"5","csvw:notes":["2%"]}]},
{"csvw:name":"4 stars","csvw:cells":[{"csvw:value":"1","csvw:notes":["0%"]}]},
{"csvw:name":"5 stars","csvw:cells":[{"csvw:value":"15","csvw:notes":["7%"]}]},
{"csvw:name":"Total","csvw:cells":[{"csvw:value":"226","csvw:notes":["100%"]}]}
]}}}}
</script>
<script type="application/ld+json">
{"aiSummary":{"modelVersion":"2.0.0","summary":"Mixed reviews."},
"aiSummaryReviews":[
{"id":"abc","rating":1,"title":"Bad","text":"Didn't work.","language":"en",
"source":"Organic","likes":2,"consumer":{"displayName":"W.FRH","countryCode":"DE","numberOfReviews":69,"isVerified":false},
"dates":{"experiencedDate":"2026-01-05T00:00:00.000Z","publishedDate":"2026-01-05T16:29:31.000Z"}}]}
</script>
</head></html>"##;
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
assert_eq!(v["domain"], "anthropic.com");
assert_eq!(v["business_name"], "Anthropic");
assert_eq!(v["rating_label"], "Bad");
assert_eq!(v["review_count"], 226);
assert_eq!(v["rating_distribution"]["one_star"]["count"], 196);
assert_eq!(v["rating_distribution"]["total"]["count"], 226);
assert_eq!(v["ai_summary"], "Mixed reviews.");
assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1);
assert_eq!(v["recent_reviews"][0]["author"], "W.FRH");
assert_eq!(v["recent_reviews"][0]["rating"], 1);
assert_eq!(v["recent_reviews"][0]["title"], "Bad");
}
#[test]
fn parse_falls_back_to_og_when_no_jsonld() {
let html = r#"<meta property="og:title" content="Anthropic is rated &quot;Bad&quot; with 1.5 / 5 on Trustpilot">
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">"#;
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
assert_eq!(v["domain"], "anthropic.com");
assert_eq!(v["business_name"], "Anthropic");
assert_eq!(v["average_rating"], "1.5");
assert_eq!(v["review_count"], 226);
assert_eq!(v["rating_label"], "Bad");
}
#[test]
fn parse_returns_ok_with_url_domain_when_nothing_else() {
let v = parse(
"<html><head></head></html>",
"https://www.trustpilot.com/review/example.com",
)
.unwrap();
assert_eq!(v["domain"], "example.com");
assert_eq!(v["business_name"], "example.com");
}
}

View file

@ -0,0 +1,237 @@
//! WooCommerce product structured extractor.
//!
//! Targets WooCommerce's Store API: `/wp-json/wc/store/v1/products?slug={slug}`.
//! About 30-50% of WooCommerce stores expose this endpoint publicly
//! (it's on by default, but common security plugins disable it).
//! When it's off, the server returns 404 at /wp-json. We surface a
//! clean error and point callers at `/v1/scrape/ecommerce_product`
//! which works on any store with Schema.org JSON-LD.
//!
//! Explicit-call only. `/product/{slug}` is the default permalink for
//! WooCommerce but custom stores use every variation imaginable, so
//! auto-dispatch is unreliable.
use serde::Deserialize;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "woocommerce_product",
label: "WooCommerce product",
description: "Returns product via the WooCommerce Store REST API (requires the /wp-json/wc/store endpoint to be enabled on the target store).",
url_patterns: &[
"https://{shop}/product/{slug}",
"https://{shop}/shop/{slug}",
],
};
pub fn matches(url: &str) -> bool {
let host = host_of(url);
if host.is_empty() {
return false;
}
// Permissive: WooCommerce stores use custom domains + custom
// permalinks. The extractor's API probe is what confirms it's
// really WooCommerce.
url.contains("/product/")
|| url.contains("/shop/")
|| url.contains("/producto/") // common es locale
|| url.contains("/produit/") // common fr locale
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let slug = parse_slug(url).ok_or_else(|| {
FetchError::Build(format!(
"woocommerce_product: cannot parse slug from '{url}'"
))
})?;
let host = host_of(url);
if host.is_empty() {
return Err(FetchError::Build(format!(
"woocommerce_product: empty host in '{url}'"
)));
}
let scheme = if url.starts_with("http://") {
"http"
} else {
"https"
};
let api_url = format!("{scheme}://{host}/wp-json/wc/store/v1/products?slug={slug}&per_page=1");
let resp = client.fetch(&api_url).await?;
if resp.status == 404 {
return Err(FetchError::Build(format!(
"woocommerce_product: {host} does not expose /wp-json/wc/store (404). \
Use /v1/scrape/ecommerce_product for JSON-LD fallback."
)));
}
if resp.status == 401 || resp.status == 403 {
return Err(FetchError::Build(format!(
"woocommerce_product: {host} requires auth for /wp-json/wc/store ({}). \
Use /v1/scrape/ecommerce_product for the public JSON-LD fallback.",
resp.status
)));
}
if resp.status != 200 {
return Err(FetchError::Build(format!(
"woocommerce api returned status {} for {api_url}",
resp.status
)));
}
let products: Vec<Product> = serde_json::from_str(&resp.html)
.map_err(|e| FetchError::BodyDecode(format!("woocommerce parse: {e}")))?;
let p = products.into_iter().next().ok_or_else(|| {
FetchError::Build(format!(
"woocommerce_product: no product found for slug '{slug}' on {host}"
))
})?;
let images: Vec<Value> = p
.images
.iter()
.map(|i| json!({"src": i.src, "thumbnail": i.thumbnail, "alt": i.alt}))
.collect();
let variations_count = p.variations.as_ref().map(|v| v.len()).unwrap_or(0);
Ok(json!({
"url": url,
"api_url": api_url,
"product_id": p.id,
"name": p.name,
"slug": p.slug,
"sku": p.sku,
"permalink": p.permalink,
"on_sale": p.on_sale,
"in_stock": p.is_in_stock,
"is_purchasable": p.is_purchasable,
"price": p.prices.as_ref().and_then(|pr| pr.price.clone()),
"regular_price": p.prices.as_ref().and_then(|pr| pr.regular_price.clone()),
"sale_price": p.prices.as_ref().and_then(|pr| pr.sale_price.clone()),
"currency": p.prices.as_ref().and_then(|pr| pr.currency_code.clone()),
"currency_minor": p.prices.as_ref().and_then(|pr| pr.currency_minor_unit),
"price_range": p.prices.as_ref().and_then(|pr| pr.price_range.clone()),
"average_rating": p.average_rating,
"review_count": p.review_count,
"description": p.description,
"short_description": p.short_description,
"categories": p.categories.iter().filter_map(|c| c.name.clone()).collect::<Vec<_>>(),
"tags": p.tags.iter().filter_map(|t| t.name.clone()).collect::<Vec<_>>(),
"variation_count": variations_count,
"image_count": images.len(),
"images": images,
}))
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn host_of(url: &str) -> &str {
url.split("://")
.nth(1)
.unwrap_or(url)
.split('/')
.next()
.unwrap_or("")
}
/// Extract the product slug from common WooCommerce permalinks.
fn parse_slug(url: &str) -> Option<String> {
for needle in ["/product/", "/shop/", "/producto/", "/produit/"] {
if let Some(after) = url.split(needle).nth(1) {
let stripped = after
.split(['?', '#'])
.next()?
.trim_end_matches('/')
.split('/')
.next()
.unwrap_or("");
if !stripped.is_empty() {
return Some(stripped.to_string());
}
}
}
None
}
// ---------------------------------------------------------------------------
// Store API types (subset of the full response)
// ---------------------------------------------------------------------------
#[derive(Deserialize)]
struct Product {
id: Option<i64>,
name: Option<String>,
slug: Option<String>,
sku: Option<String>,
permalink: Option<String>,
description: Option<String>,
short_description: Option<String>,
on_sale: Option<bool>,
is_in_stock: Option<bool>,
is_purchasable: Option<bool>,
average_rating: Option<serde_json::Value>, // string or number
review_count: Option<i64>,
prices: Option<Prices>,
#[serde(default)]
categories: Vec<Term>,
#[serde(default)]
tags: Vec<Term>,
#[serde(default)]
images: Vec<Img>,
variations: Option<Vec<serde_json::Value>>,
}
#[derive(Deserialize)]
struct Prices {
price: Option<String>,
regular_price: Option<String>,
sale_price: Option<String>,
currency_code: Option<String>,
currency_minor_unit: Option<i64>,
price_range: Option<serde_json::Value>,
}
#[derive(Deserialize)]
struct Term {
name: Option<String>,
}
#[derive(Deserialize)]
struct Img {
src: Option<String>,
thumbnail: Option<String>,
alt: Option<String>,
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_common_permalinks() {
assert!(matches("https://shop.example.com/product/cool-widget"));
assert!(matches("https://shop.example.com/shop/cool-widget"));
assert!(matches("https://tienda.example.com/producto/cosa"));
assert!(matches("https://boutique.example.com/produit/chose"));
}
#[test]
fn parse_slug_handles_locale_and_suffix() {
assert_eq!(
parse_slug("https://shop.example.com/product/cool-widget"),
Some("cool-widget".into())
);
assert_eq!(
parse_slug("https://shop.example.com/product/cool-widget/?attr=red"),
Some("cool-widget".into())
);
assert_eq!(
parse_slug("https://tienda.example.com/producto/cosa/"),
Some("cosa".into())
);
}
}

View file

@ -0,0 +1,378 @@
//! YouTube video structured extractor.
//!
//! YouTube embeds the full player configuration in a
//! `ytInitialPlayerResponse` JavaScript assignment at the top of
//! every `/watch`, `/shorts`, and `youtu.be` HTML page. We reuse the
//! core crate's already-proven regex + parse to surface typed JSON
//! from it: video id, title, author + channel id, view count,
//! duration, upload date, keywords, thumbnails, caption-track URLs.
//!
//! Auto-dispatched: YouTube host is unique and the `v=` or `/shorts/`
//! shape is stable.
//!
//! ## Fallback
//!
//! `ytInitialPlayerResponse` is missing on EU-consent interstitials,
//! some live-stream pre-show pages, and age-gated videos. In those
//! cases we drop down to OG tags for `title`, `description`,
//! `thumbnail`, and `channel`, and return a `data_source:
//! "og_fallback"` payload so the caller can tell they got a degraded
//! shape (no view count, duration, captions).
use std::sync::OnceLock;
use regex::Regex;
use serde_json::{Value, json};
use super::ExtractorInfo;
use crate::error::FetchError;
use crate::fetcher::Fetcher;
pub const INFO: ExtractorInfo = ExtractorInfo {
name: "youtube_video",
label: "YouTube video",
description: "Returns video id, title, channel, view count, duration, upload date, thumbnails, keywords, and caption-track URLs. Falls back to OG metadata on consent / age-gate pages.",
url_patterns: &[
"https://www.youtube.com/watch?v={id}",
"https://youtu.be/{id}",
"https://www.youtube.com/shorts/{id}",
],
};
pub fn matches(url: &str) -> bool {
webclaw_core::youtube::is_youtube_url(url)
|| url.contains("youtube.com/shorts/")
|| url.contains("youtube-nocookie.com/embed/")
}
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
let video_id = parse_video_id(url).ok_or_else(|| {
FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'"))
})?;
// Always fetch the canonical /watch URL. /shorts/ and youtu.be
// sometimes serve a thinner page without the player blob.
let canonical = format!("https://www.youtube.com/watch?v={video_id}");
let resp = client.fetch(&canonical).await?;
if resp.status != 200 {
return Err(FetchError::Build(format!(
"youtube returned status {} for {canonical}",
resp.status
)));
}
if let Some(player) = extract_player_response(&resp.html) {
return Ok(build_player_payload(
&player, &resp.html, url, &canonical, &video_id,
));
}
// No player blob. Fall back to OG tags so the call still returns
// something useful for consent / age-gate pages.
Ok(build_og_fallback(&resp.html, url, &canonical, &video_id))
}
// ---------------------------------------------------------------------------
// Player-blob path (rich payload)
// ---------------------------------------------------------------------------
fn build_player_payload(
player: &Value,
html: &str,
url: &str,
canonical: &str,
video_id: &str,
) -> Value {
let video_details = player.get("videoDetails");
let microformat = player
.get("microformat")
.and_then(|m| m.get("playerMicroformatRenderer"));
let thumbnails: Vec<Value> = video_details
.and_then(|vd| vd.get("thumbnail"))
.and_then(|t| t.get("thumbnails"))
.and_then(|t| t.as_array())
.cloned()
.unwrap_or_default();
let keywords: Vec<Value> = video_details
.and_then(|vd| vd.get("keywords"))
.and_then(|k| k.as_array())
.cloned()
.unwrap_or_default();
let caption_tracks = webclaw_core::youtube::extract_caption_tracks(html);
let captions: Vec<Value> = caption_tracks
.iter()
.map(|c| {
json!({
"url": c.url,
"lang": c.lang,
"name": c.name,
})
})
.collect();
json!({
"url": url,
"canonical_url":canonical,
"data_source": "player_response",
"video_id": video_id,
"title": get_str(video_details, "title"),
"description": get_str(video_details, "shortDescription"),
"author": get_str(video_details, "author"),
"channel_id": get_str(video_details, "channelId"),
"channel_url": get_str(microformat, "ownerProfileUrl"),
"view_count": get_int(video_details, "viewCount"),
"length_seconds": get_int(video_details, "lengthSeconds"),
"is_live": video_details.and_then(|vd| vd.get("isLiveContent")).and_then(|v| v.as_bool()),
"is_private": video_details.and_then(|vd| vd.get("isPrivate")).and_then(|v| v.as_bool()),
"is_unlisted": microformat.and_then(|m| m.get("isUnlisted")).and_then(|v| v.as_bool()),
"allow_ratings":video_details.and_then(|vd| vd.get("allowRatings")).and_then(|v| v.as_bool()),
"category": get_str(microformat, "category"),
"upload_date": get_str(microformat, "uploadDate"),
"publish_date": get_str(microformat, "publishDate"),
"keywords": keywords,
"thumbnails": thumbnails,
"caption_tracks": captions,
})
}
// ---------------------------------------------------------------------------
// OG fallback path (degraded payload)
// ---------------------------------------------------------------------------
fn build_og_fallback(html: &str, url: &str, canonical: &str, video_id: &str) -> Value {
let title = og(html, "title");
let description = og(html, "description");
let thumbnail = og(html, "image");
// YouTube sets `<meta name="channel_name" ...>` on some pages but
// OG-only pages reliably carry `og:video:tag` and the channel in
// `<link itemprop="name">`. We keep this lean: just what's stable.
let channel = meta_name(html, "author");
json!({
"url": url,
"canonical_url":canonical,
"data_source": "og_fallback",
"video_id": video_id,
"title": title,
"description": description,
"author": channel,
// OG path: these are null so the caller doesn't have to guess.
"channel_id": None::<String>,
"channel_url": None::<String>,
"view_count": None::<i64>,
"length_seconds": None::<i64>,
"is_live": None::<bool>,
"is_private": None::<bool>,
"is_unlisted": None::<bool>,
"allow_ratings":None::<bool>,
"category": None::<String>,
"upload_date": None::<String>,
"publish_date": None::<String>,
"keywords": Vec::<Value>::new(),
"thumbnails": thumbnail.as_ref().map(|t| vec![json!({"url": t})]).unwrap_or_default(),
"caption_tracks": Vec::<Value>::new(),
})
}
// ---------------------------------------------------------------------------
// URL helpers
// ---------------------------------------------------------------------------
fn parse_video_id(url: &str) -> Option<String> {
// youtu.be/{id}
if let Some(after) = url.split("youtu.be/").nth(1) {
let id = after
.split(['?', '#', '/'])
.next()
.unwrap_or("")
.trim_end_matches('/');
if !id.is_empty() {
return Some(id.to_string());
}
}
// youtube.com/shorts/{id}
if let Some(after) = url.split("youtube.com/shorts/").nth(1) {
let id = after
.split(['?', '#', '/'])
.next()
.unwrap_or("")
.trim_end_matches('/');
if !id.is_empty() {
return Some(id.to_string());
}
}
// youtube-nocookie.com/embed/{id}
if let Some(after) = url.split("/embed/").nth(1) {
let id = after
.split(['?', '#', '/'])
.next()
.unwrap_or("")
.trim_end_matches('/');
if !id.is_empty() {
return Some(id.to_string());
}
}
// youtube.com/watch?v={id} (also matches youtube.com/watch?foo=bar&v={id})
if let Some(q) = url.split_once('?').map(|(_, q)| q)
&& let Some(id) = q
.split('&')
.find_map(|p| p.strip_prefix("v=").map(|v| v.to_string()))
{
let id = id.split(['#', '/']).next().unwrap_or(&id).to_string();
if !id.is_empty() {
return Some(id);
}
}
None
}
// ---------------------------------------------------------------------------
// Player-response parsing
// ---------------------------------------------------------------------------
fn extract_player_response(html: &str) -> Option<Value> {
// Same regex as webclaw_core::youtube. Duplicated here because
// core's regex is module-private. Kept in lockstep; changes are
// rare and we cover with tests in both places.
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE
.get_or_init(|| Regex::new(r"var\s+ytInitialPlayerResponse\s*=\s*(\{.+?\})\s*;").unwrap());
let json_str = re.captures(html)?.get(1)?.as_str();
serde_json::from_str(json_str).ok()
}
// ---------------------------------------------------------------------------
// Meta-tag helpers (for OG fallback)
// ---------------------------------------------------------------------------
fn og(html: &str, prop: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == prop) {
return c.get(2).map(|m| m.as_str().to_string());
}
}
None
}
fn meta_name(html: &str, name: &str) -> Option<String> {
static RE: OnceLock<Regex> = OnceLock::new();
let re = RE.get_or_init(|| {
Regex::new(r#"(?i)<meta[^>]+name="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
});
for c in re.captures_iter(html) {
if c.get(1).is_some_and(|m| m.as_str() == name) {
return c.get(2).map(|m| m.as_str().to_string());
}
}
None
}
fn get_str(v: Option<&Value>, key: &str) -> Option<String> {
v.and_then(|x| x.get(key))
.and_then(|x| x.as_str().map(String::from))
}
fn get_int(v: Option<&Value>, key: &str) -> Option<i64> {
v.and_then(|x| x.get(key)).and_then(|x| {
x.as_i64()
.or_else(|| x.as_str().and_then(|s| s.parse::<i64>().ok()))
})
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn matches_watch_urls() {
assert!(matches("https://www.youtube.com/watch?v=dQw4w9WgXcQ"));
assert!(matches("https://youtu.be/dQw4w9WgXcQ"));
assert!(matches("https://www.youtube.com/shorts/abc123"));
assert!(matches(
"https://www.youtube-nocookie.com/embed/dQw4w9WgXcQ"
));
}
#[test]
fn rejects_non_video_urls() {
assert!(!matches("https://www.youtube.com/"));
assert!(!matches("https://www.youtube.com/channel/abc"));
assert!(!matches("https://example.com/watch?v=abc"));
}
#[test]
fn parse_video_id_from_each_shape() {
assert_eq!(
parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ"),
Some("dQw4w9WgXcQ".into())
);
assert_eq!(
parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=10s"),
Some("dQw4w9WgXcQ".into())
);
assert_eq!(
parse_video_id("https://www.youtube.com/watch?feature=share&v=dQw4w9WgXcQ"),
Some("dQw4w9WgXcQ".into())
);
assert_eq!(
parse_video_id("https://youtu.be/dQw4w9WgXcQ"),
Some("dQw4w9WgXcQ".into())
);
assert_eq!(
parse_video_id("https://youtu.be/dQw4w9WgXcQ?t=30"),
Some("dQw4w9WgXcQ".into())
);
assert_eq!(
parse_video_id("https://www.youtube.com/shorts/abc123"),
Some("abc123".into())
);
}
#[test]
fn extract_player_response_happy_path() {
let html = r#"
<html><body>
<script>
var ytInitialPlayerResponse = {"videoDetails":{"videoId":"abc","title":"T","author":"A","viewCount":"100","lengthSeconds":"60","shortDescription":"d"}};
</script>
</body></html>
"#;
let v = extract_player_response(html).unwrap();
let vd = v.get("videoDetails").unwrap();
assert_eq!(vd.get("title").unwrap().as_str(), Some("T"));
}
#[test]
fn og_fallback_extracts_basics_from_meta_tags() {
let html = r##"
<html><head>
<meta property="og:title" content="Example Video Title">
<meta property="og:description" content="A cool video description.">
<meta property="og:image" content="https://i.ytimg.com/vi/abc/maxresdefault.jpg">
<meta name="author" content="Example Channel">
</head></html>"##;
let v = build_og_fallback(
html,
"https://www.youtube.com/watch?v=abc",
"https://www.youtube.com/watch?v=abc",
"abc",
);
assert_eq!(v["data_source"], "og_fallback");
assert_eq!(v["title"], "Example Video Title");
assert_eq!(v["description"], "A cool video description.");
assert_eq!(v["author"], "Example Channel");
assert_eq!(
v["thumbnails"][0]["url"],
"https://i.ytimg.com/vi/abc/maxresdefault.jpg"
);
assert!(v["view_count"].is_null());
assert!(v["caption_tracks"].as_array().unwrap().is_empty());
}
}

View file

@ -0,0 +1,118 @@
//! Pluggable fetcher abstraction for vertical extractors.
//!
//! Extractors call the network through this trait instead of hard-
//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all
//! pass `&FetchClient` (wreq-backed BoringSSL). The production API
//! server, which must not use in-process TLS fingerprinting, provides
//! its own implementation that routes through the Go tls-sidecar.
//!
//! Both paths expose the same [`FetchResult`] shape and the same
//! optional cloud-escalation client, so extractor logic stays
//! identical across environments.
//!
//! ## Choosing an implementation
//!
//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`]
//! with [`FetchClient::with_cloud`] to attach cloud fallback, pass
//! it to extractors as `&client`.
//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher`
//! (in `server/src/engine/`) that delegates to `engine::tls_client`
//! and wraps it in `Arc<dyn Fetcher>` for handler injection.
//!
//! ## Why a trait and not a free function
//!
//! Extractors need state beyond a single fetch: the cloud client for
//! antibot escalation, and in the future per-user proxy pools, tenant
//! headers, circuit breakers. A trait keeps that state encapsulated
//! behind the fetch interface instead of threading it through every
//! extractor signature.
use async_trait::async_trait;
use crate::client::FetchResult;
use crate::cloud::CloudClient;
use crate::error::FetchError;
/// HTTP fetch surface used by vertical extractors.
///
/// Implementations must be `Send + Sync` because extractor dispatchers
/// run them inside tokio tasks, potentially across many requests.
#[async_trait]
pub trait Fetcher: Send + Sync {
/// Fetch a URL and return the raw response body + metadata. The
/// body is in `FetchResult::html` regardless of the actual content
/// type — JSON API endpoints put JSON there, HTML pages put HTML.
/// Extractors branch on response status and body shape.
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError>;
/// Fetch with additional request headers. Needed for endpoints
/// that authenticate via a specific header (Instagram's
/// `x-ig-app-id`, for example). Default implementation routes to
/// [`Self::fetch`] so implementers without header support stay
/// functional, though the `Option<String>` field they'd set won't
/// be populated on the request.
async fn fetch_with_headers(
&self,
url: &str,
_headers: &[(&str, &str)],
) -> Result<FetchResult, FetchError> {
self.fetch(url).await
}
/// Optional cloud-escalation client for antibot bypass. Returning
/// `Some` tells extractors they can call into the hosted API when
/// local fetch hits a challenge page. Returning `None` makes
/// cloud-gated extractors emit [`CloudError::NotConfigured`] with
/// an actionable signup link.
///
/// The default implementation returns `None` because not every
/// deployment wants cloud fallback (self-hosts that don't have a
/// webclaw.io subscription, for instance).
///
/// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured
fn cloud(&self) -> Option<&CloudClient> {
None
}
}
// ---------------------------------------------------------------------------
// Blanket impls: make `&T` and `Arc<T>` behave like the wrapped `T`.
// ---------------------------------------------------------------------------
#[async_trait]
impl<T: Fetcher + ?Sized> Fetcher for &T {
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
(**self).fetch(url).await
}
async fn fetch_with_headers(
&self,
url: &str,
headers: &[(&str, &str)],
) -> Result<FetchResult, FetchError> {
(**self).fetch_with_headers(url, headers).await
}
fn cloud(&self) -> Option<&CloudClient> {
(**self).cloud()
}
}
#[async_trait]
impl<T: Fetcher + ?Sized> Fetcher for std::sync::Arc<T> {
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
(**self).fetch(url).await
}
async fn fetch_with_headers(
&self,
url: &str,
headers: &[(&str, &str)],
) -> Result<FetchResult, FetchError> {
(**self).fetch_with_headers(url, headers).await
}
fn cloud(&self) -> Option<&CloudClient> {
(**self).cloud()
}
}

View file

@ -3,10 +3,14 @@
//! Automatically detects PDF responses and delegates to webclaw-pdf.
pub mod browser;
pub mod client;
pub mod cloud;
pub mod crawler;
pub mod document;
pub mod error;
pub mod extractors;
pub mod fetcher;
pub mod linkedin;
pub mod locale;
pub mod proxy;
pub mod reddit;
pub mod sitemap;
@ -16,7 +20,9 @@ pub use browser::BrowserProfile;
pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
pub use error::FetchError;
pub use fetcher::Fetcher;
pub use http::HeaderMap;
pub use locale::{accept_language_for_tld, accept_language_for_url};
pub use proxy::{parse_proxy_file, parse_proxy_line};
pub use sitemap::SitemapEntry;
pub use webclaw_pdf::PdfMode;

View file

@ -0,0 +1,77 @@
//! Derive an `Accept-Language` header from a URL.
//!
//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it,
//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the
//! target country + a browser UA but the wrong `Accept-Language` is a bot
//! signal. Matching the site's expected locale gets us through.
//!
//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
/// Best-effort `Accept-Language` header value for the given URL's TLD.
/// Returns `None` if the URL cannot be parsed.
pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
let tld = host.rsplit('.').next()?;
Some(accept_language_for_tld(tld))
}
/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
/// Unknown TLDs fall back to US English.
pub fn accept_language_for_tld(tld: &str) -> &'static str {
match tld {
"it" => "it-IT,it;q=0.9",
"fr" => "fr-FR,fr;q=0.9",
"de" | "at" => "de-DE,de;q=0.9",
"es" => "es-ES,es;q=0.9",
"pt" => "pt-PT,pt;q=0.9",
"nl" => "nl-NL,nl;q=0.9",
"pl" => "pl-PL,pl;q=0.9",
"se" => "sv-SE,sv;q=0.9",
"no" => "nb-NO,nb;q=0.9",
"dk" => "da-DK,da;q=0.9",
"fi" => "fi-FI,fi;q=0.9",
"cz" => "cs-CZ,cs;q=0.9",
"ro" => "ro-RO,ro;q=0.9",
"gr" => "el-GR,el;q=0.9",
"tr" => "tr-TR,tr;q=0.9",
"ru" => "ru-RU,ru;q=0.9",
"jp" => "ja-JP,ja;q=0.9",
"kr" => "ko-KR,ko;q=0.9",
"cn" => "zh-CN,zh;q=0.9",
"tw" | "hk" => "zh-TW,zh;q=0.9",
"br" => "pt-BR,pt;q=0.9",
"mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
"uk" | "ie" => "en-GB,en;q=0.9",
_ => "en-US,en;q=0.9",
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn tld_dispatch() {
assert_eq!(
accept_language_for_url("https://www.immobiliare.it/annunci/1"),
Some("it-IT,it;q=0.9")
);
assert_eq!(
accept_language_for_url("https://www.leboncoin.fr/"),
Some("fr-FR,fr;q=0.9")
);
assert_eq!(
accept_language_for_url("https://www.amazon.co.uk/"),
Some("en-GB,en;q=0.9")
);
assert_eq!(
accept_language_for_url("https://example.com/"),
Some("en-US,en;q=0.9")
);
}
#[test]
fn bad_url_returns_none() {
assert_eq!(accept_language_for_url("not-a-url"), None);
}
}

View file

@ -7,10 +7,15 @@
use std::time::Duration;
use std::borrow::Cow;
use wreq::http2::{
Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
};
use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion};
use wreq::tls::{
AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions,
TlsVersion,
};
use wreq::{Client, Emulation};
use crate::browser::BrowserVariant;
@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc
/// Safari curves.
const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521";
/// Safari iOS 26 TLS extension order, matching bogdanfinn's
/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq
/// inserts them itself. Diverges from wreq-util's default SafariIos26
/// extension order, which DataDome's immobiliare.it ruleset flags.
fn safari_ios_extensions() -> Vec<ExtensionType> {
vec![
ExtensionType::CERTIFICATE_TIMESTAMP,
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION,
ExtensionType::SERVER_NAME,
ExtensionType::CERT_COMPRESSION,
ExtensionType::KEY_SHARE,
ExtensionType::SUPPORTED_VERSIONS,
ExtensionType::PSK_KEY_EXCHANGE_MODES,
ExtensionType::SUPPORTED_GROUPS,
ExtensionType::RENEGOTIATE,
ExtensionType::SIGNATURE_ALGORITHMS,
ExtensionType::STATUS_REQUEST,
ExtensionType::EC_POINT_FORMATS,
ExtensionType::EXTENDED_MASTER_SECRET,
]
}
/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3
/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions
/// per handshake, but indeed.com's WAF allowlists this specific wire order
/// and rejects permuted ones. GREASE slots are inserted by wreq.
///
/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23
fn chrome_extensions() -> Vec<ExtensionType> {
vec![
ExtensionType::CERTIFICATE_TIMESTAMP, // 18
ExtensionType::STATUS_REQUEST, // 5
ExtensionType::SESSION_TICKET, // 35
ExtensionType::KEY_SHARE, // 51
ExtensionType::SUPPORTED_GROUPS, // 10
ExtensionType::PSK_KEY_EXCHANGE_MODES, // 45
ExtensionType::EC_POINT_FORMATS, // 11
ExtensionType::CERT_COMPRESSION, // 27
ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint)
ExtensionType::SUPPORTED_VERSIONS, // 43
ExtensionType::SIGNATURE_ALGORITHMS, // 13
ExtensionType::SERVER_NAME, // 0
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
ExtensionType::ENCRYPTED_CLIENT_HELLO, // 65037
ExtensionType::RENEGOTIATE, // 65281
ExtensionType::EXTENDED_MASTER_SECRET, // 23
]
}
// --- Chrome HTTP headers in correct wire order ---
const CHROME_HEADERS: &[(&str, &str)] = &[
@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[
("sec-fetch-dest", "document"),
];
/// Safari iOS 26 headers, in the wire order real Safari emits. Critically:
/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but
/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not
/// include zstd (Safari can't decode it). Verified against bogdanfinn on
/// 2026-04-22: this header set is what DataDome's immobiliare ruleset
/// expects for a real iPhone.
const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[
(
"accept",
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
),
("accept-language", "en-US,en;q=0.9"),
("accept-encoding", "gzip, deflate, br"),
(
"user-agent",
"Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1",
),
("upgrade-insecure-requests", "1"),
];
const EDGE_HEADERS: &[(&str, &str)] = &[
(
"sec-ch-ua",
@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[
];
fn chrome_tls() -> TlsOptions {
// permute_extensions is off so the explicit extension_permutation sticks.
// Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's
// fixed order, so matching that gets us through.
TlsOptions::builder()
.cipher_list(CHROME_CIPHERS)
.sigalgs_list(CHROME_SIGALGS)
@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions {
.min_tls_version(TlsVersion::TLS_1_2)
.max_tls_version(TlsVersion::TLS_1_3)
.grease_enabled(true)
.permute_extensions(true)
.permute_extensions(false)
.extension_permutation(chrome_extensions())
.enable_ech_grease(true)
.pre_shared_key(true)
.enable_ocsp_stapling(true)
.enable_signed_cert_timestamps(true)
.alps_protocols([AlpsProtocol::HTTP2])
.alpn_protocols([
AlpnProtocol::HTTP3,
AlpnProtocol::HTTP2,
AlpnProtocol::HTTP1,
])
.alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
.alps_use_new_codepoint(true)
.aes_hw_override(true)
.certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI])
@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions {
.build()
}
/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26`
/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox
/// because the wire-level defaults from wreq-util are already correct for ciphers,
/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for
/// DataDome compatibility are overridden here:
///
/// 1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3
/// ends up `8d909525bd5bbb79f133d11cc05159fe`).
/// 2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0.
/// wreq-util omits this frame; real Safari and bogdanfinn include it.
/// This flip is the thing DataDome actually reads — the akamai_fingerprint
/// hash changes from `c52879e43202aeb92740be6e8c86ea96` to
/// `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature.
/// 3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`,
/// `priority: u=0, i`, zstd), replace with the real iOS 26 set.
/// 4. `accept-language` preserved from config.extra_headers for locale.
fn safari_ios_emulation() -> wreq::Emulation {
use wreq::EmulationFactory;
let mut em = wreq_util::Emulation::SafariIos26.emulation();
if let Some(tls) = em.tls_options_mut().as_mut() {
tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
}
// Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
// and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
// to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
if let Some(h2) = em.http2_options_mut().as_mut() {
h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
}
let hm = em.headers_mut();
hm.clear();
for (k, v) in SAFARI_IOS_HEADERS {
if let (Ok(n), Ok(val)) = (
http::header::HeaderName::from_bytes(k.as_bytes()),
http::header::HeaderValue::from_str(v),
) {
hm.append(n, val);
}
}
em
}
fn chrome_h2() -> Http2Options {
// SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE,
// ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No
// MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it,
// and indeed.com's WAF reads this as a bot signal when present. Priority
// weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame.
Http2Options::builder()
.initial_window_size(6_291_456)
.initial_connection_window_size(15_728_640)
.max_header_list_size(262_144)
.header_table_size(65_536)
.max_concurrent_streams(1000u32)
.enable_push(false)
.settings_order(
SettingsOrder::builder()
.extend([
SettingId::HeaderTableSize,
SettingId::EnablePush,
SettingId::MaxConcurrentStreams,
SettingId::InitialWindowSize,
SettingId::MaxFrameSize,
SettingId::MaxHeaderListSize,
SettingId::EnableConnectProtocol,
SettingId::NoRfc7540Priorities,
])
.build(),
)
@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options {
])
.build(),
)
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true))
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true))
.build()
}
@ -328,32 +456,38 @@ pub fn build_client(
extra_headers: &std::collections::HashMap<String, String>,
proxy: Option<&str>,
) -> Result<Client, FetchError> {
let (tls, h2, headers) = match variant {
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
// SafariIos26 builds its Emulation on top of wreq-util's base instead
// of from scratch. See `safari_ios_emulation` for why.
let mut emulation = match variant {
BrowserVariant::SafariIos26 => safari_ios_emulation(),
other => {
let (tls, h2, headers) = match other {
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
BrowserVariant::SafariIos26 => unreachable!("handled above"),
};
Emulation::builder()
.tls_options(tls)
.http2_options(h2)
.headers(build_headers(headers))
.build()
}
};
let mut header_map = build_headers(headers);
// Append extra headers after profile defaults
// Append extra headers after profile defaults.
let hm = emulation.headers_mut();
for (k, v) in extra_headers {
if let (Ok(n), Ok(val)) = (
http::header::HeaderName::from_bytes(k.as_bytes()),
http::header::HeaderValue::from_str(v),
) {
header_map.insert(n, val);
hm.insert(n, val);
}
}
let emulation = Emulation::builder()
.tls_options(tls)
.http2_options(h2)
.headers(header_map)
.build();
let mut builder = Client::builder()
.emulation(emulation)
.redirect(wreq::redirect::Policy::limited(10))

View file

@ -22,6 +22,5 @@ serde_json = { workspace = true }
tokio = { workspace = true }
tracing = { workspace = true }
tracing-subscriber = { workspace = true }
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
url = "2"
dirs = "6.0.0"

View file

@ -1,302 +0,0 @@
/// Cloud API fallback for protected sites.
///
/// When local fetch returns a challenge page, this module retries
/// via api.webclaw.io. Requires WEBCLAW_API_KEY to be set.
use std::time::Duration;
use serde_json::{Value, json};
use tracing::info;
const API_BASE: &str = "https://api.webclaw.io/v1";
/// Lightweight client for the webclaw cloud API.
pub struct CloudClient {
api_key: String,
http: reqwest::Client,
}
impl CloudClient {
/// Create a new cloud client from WEBCLAW_API_KEY env var.
/// Returns None if the key is not set.
pub fn from_env() -> Option<Self> {
let key = std::env::var("WEBCLAW_API_KEY").ok()?;
if key.is_empty() {
return None;
}
let http = reqwest::Client::builder()
.timeout(Duration::from_secs(60))
.build()
.unwrap_or_default();
Some(Self { api_key: key, http })
}
/// Scrape a URL via the cloud API. Returns the response JSON.
pub async fn scrape(
&self,
url: &str,
formats: &[&str],
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
) -> Result<Value, String> {
let mut body = json!({
"url": url,
"formats": formats,
});
if only_main_content {
body["only_main_content"] = json!(true);
}
if !include_selectors.is_empty() {
body["include_selectors"] = json!(include_selectors);
}
if !exclude_selectors.is_empty() {
body["exclude_selectors"] = json!(exclude_selectors);
}
self.post("scrape", body).await
}
/// Generic POST to the cloud API.
pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
let resp = self
.http
.post(format!("{API_BASE}/{endpoint}"))
.header("Authorization", format!("Bearer {}", self.api_key))
.json(&body)
.send()
.await
.map_err(|e| format!("Cloud API request failed: {e}"))?;
let status = resp.status();
if !status.is_success() {
let text = resp.text().await.unwrap_or_default();
let truncated = truncate_error(&text);
return Err(format!("Cloud API error {status}: {truncated}"));
}
resp.json::<Value>()
.await
.map_err(|e| format!("Cloud API response parse failed: {e}"))
}
/// Generic GET from the cloud API.
pub async fn get(&self, endpoint: &str) -> Result<Value, String> {
let resp = self
.http
.get(format!("{API_BASE}/{endpoint}"))
.header("Authorization", format!("Bearer {}", self.api_key))
.send()
.await
.map_err(|e| format!("Cloud API request failed: {e}"))?;
let status = resp.status();
if !status.is_success() {
let text = resp.text().await.unwrap_or_default();
let truncated = truncate_error(&text);
return Err(format!("Cloud API error {status}: {truncated}"));
}
resp.json::<Value>()
.await
.map_err(|e| format!("Cloud API response parse failed: {e}"))
}
}
/// Truncate error body to avoid flooding logs with huge HTML responses.
fn truncate_error(text: &str) -> &str {
const MAX_LEN: usize = 500;
match text.char_indices().nth(MAX_LEN) {
Some((byte_pos, _)) => &text[..byte_pos],
None => text,
}
}
/// Check if fetched HTML looks like a bot protection challenge page.
/// Detects common bot protection challenge pages.
pub fn is_bot_protected(html: &str, headers: &webclaw_fetch::HeaderMap) -> bool {
let html_lower = html.to_lowercase();
// Cloudflare challenge page
if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
return true;
}
// Cloudflare "checking your browser" spinner
if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
&& html_lower.contains("cf-spinner")
{
return true;
}
// Cloudflare Turnstile (only on short pages = challenge, not embedded on real content)
if (html_lower.contains("cf-turnstile")
|| html_lower.contains("challenges.cloudflare.com/turnstile"))
&& html.len() < 100_000
{
return true;
}
// DataDome
if html_lower.contains("geo.captcha-delivery.com")
|| html_lower.contains("captcha-delivery.com/captcha")
{
return true;
}
// AWS WAF
if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
return true;
}
// hCaptcha blocking page
if html_lower.contains("hcaptcha.com")
&& html_lower.contains("h-captcha")
&& html.len() < 50_000
{
return true;
}
// Cloudflare via headers + challenge body
let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
if has_cf_headers
&& (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
{
return true;
}
false
}
/// Check if a page likely needs JS rendering (SPA with almost no text content).
pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
let has_scripts = html.contains("<script");
// Tier 1: almost no extractable text from a large page
if word_count < 50 && html.len() > 5_000 && has_scripts {
return true;
}
// Tier 2: SPA framework detected with suspiciously low content-to-HTML ratio
if word_count < 800 && html.len() > 50_000 && has_scripts {
let html_lower = html.to_lowercase();
let has_spa_marker = html_lower.contains("react-app")
|| html_lower.contains("id=\"__next\"")
|| html_lower.contains("id=\"root\"")
|| html_lower.contains("id=\"app\"")
|| html_lower.contains("__next_data__")
|| html_lower.contains("nuxt")
|| html_lower.contains("ng-app");
if has_spa_marker {
return true;
}
}
false
}
/// Result of a smart fetch: either local extraction or cloud API response.
pub enum SmartFetchResult {
/// Successfully extracted locally.
Local(Box<webclaw_core::ExtractionResult>),
/// Fell back to cloud API. Contains the API response JSON.
Cloud(Value),
}
/// Try local fetch first, fall back to cloud API if bot-protected or JS-rendered.
///
/// Returns the extraction result (local) or the cloud API response JSON.
/// If no API key is configured and local fetch is blocked, returns an error
/// with a helpful message.
pub async fn smart_fetch(
client: &webclaw_fetch::FetchClient,
cloud: Option<&CloudClient>,
url: &str,
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
formats: &[&str],
) -> Result<SmartFetchResult, String> {
// Step 1: Try local fetch (with timeout to avoid hanging on slow servers)
let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
.await
.map_err(|_| format!("Fetch timed out after 30s for {url}"))?
.map_err(|e| format!("Fetch failed: {e}"))?;
// Step 2: Check for bot protection
if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
info!(url, "bot protection detected, falling back to cloud API");
return cloud_fallback(
cloud,
url,
include_selectors,
exclude_selectors,
only_main_content,
formats,
)
.await;
}
// Step 3: Extract locally
let options = webclaw_core::ExtractionOptions {
include_selectors: include_selectors.to_vec(),
exclude_selectors: exclude_selectors.to_vec(),
only_main_content,
include_raw_html: false,
};
let extraction =
webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
.map_err(|e| format!("Extraction failed: {e}"))?;
// Step 4: Check for JS-rendered pages (low content from large HTML)
if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
info!(
url,
word_count = extraction.metadata.word_count,
html_len = fetch_result.html.len(),
"JS-rendered page detected, falling back to cloud API"
);
return cloud_fallback(
cloud,
url,
include_selectors,
exclude_selectors,
only_main_content,
formats,
)
.await;
}
Ok(SmartFetchResult::Local(Box::new(extraction)))
}
async fn cloud_fallback(
cloud: Option<&CloudClient>,
url: &str,
include_selectors: &[String],
exclude_selectors: &[String],
only_main_content: bool,
formats: &[&str],
) -> Result<SmartFetchResult, String> {
match cloud {
Some(c) => {
let resp = c
.scrape(
url,
formats,
include_selectors,
exclude_selectors,
only_main_content,
)
.await?;
info!(url, "cloud API fallback successful");
Ok(SmartFetchResult::Cloud(resp))
}
None => Err(format!(
"Bot protection detected on {url}. Set WEBCLAW_API_KEY for automatic cloud bypass. \
Get a key at https://webclaw.io"
)),
}
}

View file

@ -1,7 +1,6 @@
/// webclaw-mcp: MCP (Model Context Protocol) server for webclaw.
/// Exposes web extraction tools over stdio transport for AI agents
/// like Claude Desktop, Claude Code, and other MCP clients.
mod cloud;
mod server;
mod tools;

View file

@ -15,7 +15,8 @@ use serde_json::json;
use tracing::{error, info, warn};
use url::Url;
use crate::cloud::{self, CloudClient, SmartFetchResult};
use webclaw_fetch::cloud::{self, CloudClient, SmartFetchResult};
use crate::tools::*;
pub struct WebclawMcp {
@ -717,6 +718,55 @@ impl WebclawMcp {
Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
}
}
/// List every vertical extractor the server knows about. Returns a
/// JSON array of `{name, label, description, url_patterns}` entries.
/// Call this to discover what verticals are available before using
/// `vertical_scrape`.
#[tool]
async fn list_extractors(
&self,
Parameters(_params): Parameters<ListExtractorsParams>,
) -> Result<String, String> {
let catalog = webclaw_fetch::extractors::list();
serde_json::to_string_pretty(&catalog)
.map_err(|e| format!("failed to serialise extractor catalog: {e}"))
}
/// Run a vertical extractor by name and return typed JSON specific
/// to the target site (title, price, rating, author, etc.), not
/// generic markdown. Use `list_extractors` to discover available
/// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
/// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
///
/// Antibot-gated verticals (amazon_product, ebay_listing,
/// etsy_listing, trustpilot_reviews) will automatically escalate to
/// the webclaw cloud API when local fetch hits bot protection,
/// provided `WEBCLAW_API_KEY` is set.
#[tool]
async fn vertical_scrape(
&self,
Parameters(params): Parameters<VerticalParams>,
) -> Result<String, String> {
validate_url(&params.url)?;
// Use the cached Firefox client, not the default Chrome one.
// Reddit's `.json` endpoint rejects the wreq-Chrome TLS
// fingerprint with a 403 even from residential IPs (they
// ship a fingerprint blocklist that includes common
// browser-emulation libraries). The wreq-Firefox fingerprint
// still passes, and Firefox is equally fine for every other
// vertical in the catalog, so it's a strictly-safer default
// for `vertical_scrape` than the generic `scrape` tool's
// Chrome default. Matches the CLI `webclaw vertical`
// subcommand which already uses Firefox.
let client = self.firefox_or_build()?;
let data =
webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), &params.name, &params.url)
.await
.map_err(|e| e.to_string())?;
serde_json::to_string_pretty(&data)
.map_err(|e| format!("failed to serialise extractor output: {e}"))
}
}
#[tool_handler]
@ -726,7 +776,8 @@ impl ServerHandler for WebclawMcp {
.with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
.with_instructions(String::from(
"Webclaw MCP server -- web content extraction for AI agents. \
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
list_extractors, vertical_scrape.",
))
}
}

View file

@ -103,3 +103,20 @@ pub struct SearchParams {
/// Number of results to return (default: 10)
pub num_results: Option<u32>,
}
/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
#[derive(Debug, Deserialize, JsonSchema)]
pub struct VerticalParams {
/// Name of the vertical extractor. Call `list_extractors` to see all
/// available names. Examples: "reddit", "github_repo", "pypi",
/// "trustpilot_reviews", "youtube_video", "shopify_product".
pub name: String,
/// URL to extract. Must match the URL patterns the extractor claims;
/// otherwise the tool returns a clear "URL mismatch" error.
pub url: String,
}
/// `list_extractors` takes no arguments but we still need an empty struct
/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
#[derive(Debug, Deserialize, JsonSchema)]
pub struct ListExtractorsParams {}

View file

@ -79,10 +79,15 @@ async fn main() -> anyhow::Result<()> {
let v1 = Router::new()
.route("/scrape", post(routes::scrape::scrape))
.route(
"/scrape/{vertical}",
post(routes::structured::scrape_vertical),
)
.route("/crawl", post(routes::crawl::crawl))
.route("/map", post(routes::map::map))
.route("/batch", post(routes::batch::batch))
.route("/extract", post(routes::extract::extract))
.route("/extractors", get(routes::structured::list_extractors))
.route("/summarize", post(routes::summarize::summarize_route))
.route("/diff", post(routes::diff::diff_route))
.route("/brand", post(routes::brand::brand))

View file

@ -15,4 +15,5 @@ pub mod extract;
pub mod health;
pub mod map;
pub mod scrape;
pub mod structured;
pub mod summarize;

View file

@ -0,0 +1,55 @@
//! `POST /v1/scrape/{vertical}` and `GET /v1/extractors`.
//!
//! Vertical extractors return typed JSON instead of generic markdown.
//! See `webclaw_fetch::extractors` for the catalog and per-site logic.
use axum::{
Json,
extract::{Path, State},
};
use serde::Deserialize;
use serde_json::{Value, json};
use webclaw_fetch::extractors::{self, ExtractorDispatchError};
use crate::{error::ApiError, state::AppState};
#[derive(Debug, Deserialize)]
pub struct ScrapeRequest {
pub url: String,
}
/// Map dispatcher errors to ApiError so users get clean HTTP statuses
/// instead of opaque 500s.
impl From<ExtractorDispatchError> for ApiError {
fn from(e: ExtractorDispatchError) -> Self {
match e {
ExtractorDispatchError::UnknownVertical(_) => ApiError::NotFound,
ExtractorDispatchError::UrlMismatch { .. } => ApiError::bad_request(e.to_string()),
ExtractorDispatchError::Fetch(f) => ApiError::Fetch(f.to_string()),
}
}
}
/// `GET /v1/extractors` — catalog of all available verticals.
pub async fn list_extractors() -> Json<Value> {
Json(json!({
"extractors": extractors::list(),
}))
}
/// `POST /v1/scrape/{vertical}` — explicit vertical, e.g. /v1/scrape/reddit.
pub async fn scrape_vertical(
State(state): State<AppState>,
Path(vertical): Path<String>,
Json(req): Json<ScrapeRequest>,
) -> Result<Json<Value>, ApiError> {
if req.url.trim().is_empty() {
return Err(ApiError::bad_request("`url` is required"));
}
let data = extractors::dispatch_by_name(state.fetch(), &vertical, &req.url).await?;
Ok(Json(json!({
"vertical": vertical,
"url": req.url,
"data": data,
})))
}

View file

@ -1,7 +1,24 @@
//! Shared application state. Cheap to clone via Arc; held by the axum
//! Router for the life of the process.
//!
//! Two unrelated keys get carried here:
//!
//! 1. [`AppState::api_key`] — the **bearer token clients must present**
//! to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`.
//! Unset = open mode.
//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our
//! **outbound** credential for api.webclaw.io, used by extractors
//! that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`.
//! Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY"
//! error with a signup link.
//!
//! Different variables on purpose: conflating the two means operators
//! who want their server behind an auth token can't also enable cloud
//! fallback, and vice versa.
use std::sync::Arc;
use tracing::info;
use webclaw_fetch::cloud::CloudClient;
use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
/// Single-process state shared across all request handlers.
@ -17,6 +34,7 @@ struct Inner {
/// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
/// them nothing.
pub fetch: Arc<FetchClient>,
/// Inbound bearer-auth token for this server's own `/v1/*` surface.
pub api_key: Option<String>,
}
@ -24,17 +42,34 @@ impl AppState {
/// Build the application state. The fetch client is constructed once
/// and shared across requests so connection pools + browser profile
/// state don't churn per request.
pub fn new(api_key: Option<String>) -> anyhow::Result<Self> {
///
/// `inbound_api_key` is the bearer token clients must present;
/// cloud-fallback credentials come from the env (checked here).
pub fn new(inbound_api_key: Option<String>) -> anyhow::Result<Self> {
let config = FetchConfig {
browser: BrowserProfile::Chrome,
browser: BrowserProfile::Firefox,
..FetchConfig::default()
};
let fetch = FetchClient::new(config)
let mut fetch = FetchClient::new(config)
.map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
// Cloud fallback: only activates when the operator has provided
// an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY
// (preferred, disambiguates from the inbound-auth key) and
// WEBCLAW_API_KEY as a fallback when there's no inbound key
// configured (backwards compat with MCP / CLI conventions).
if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) {
info!(
base = cloud.base_url(),
"cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io"
);
fetch = fetch.with_cloud(cloud);
}
Ok(Self {
inner: Arc::new(Inner {
fetch: Arc::new(fetch),
api_key,
api_key: inbound_api_key,
}),
})
}
@ -47,3 +82,26 @@ impl AppState {
self.inner.api_key.as_deref()
}
}
/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`;
/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is
/// configured (i.e. open mode — the same env var can't mean two
/// things to one process).
fn build_cloud_client(inbound_api_key: Option<&str>) -> Option<CloudClient> {
let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok();
if let Some(k) = cloud_key.as_deref()
&& !k.trim().is_empty()
{
return Some(CloudClient::with_key(k));
}
// Reuse WEBCLAW_API_KEY only when not also acting as our own
// inbound-auth token — otherwise we'd be telling the operator
// they can't have both.
if inbound_api_key.is_none()
&& let Ok(k) = std::env::var("WEBCLAW_API_KEY")
&& !k.trim().is_empty()
{
return Some(CloudClient::with_key(k));
}
None
}