mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
Two ecommerce extractors covering the long tail of online stores:
- shopify_product: hits the public /products/{handle}.json endpoint
that every Shopify store exposes. Undocumented but stable for 10+
years. Returns title, vendor, product_type, tags, full variants
array (price, SKU, stock, options), images, options matrix, and
the price_min/price_max/any_available summary fields. Covers the
~4M Shopify stores out there, modulo stores that put Cloudflare
in front of the shop. Rejects known non-Shopify hosts (amazon,
etsy, walmart, etc.) to save a failed request.
- ecommerce_product: generic Schema.org Product JSON-LD extractor.
Works on any modern store that ships the Google-required Product
rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace,
Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN,
images, normalized offers (Offer and AggregateOffer flattened into
one shape with price, currency, availability, condition),
aggregateRating, and the raw JSON-LD block for anyone who wants it.
Reuses webclaw_core::structured_data::extract_json_ld so the
JSON-LD parser stays shared across the extraction pipeline.
Both are explicit-call only — /v1/scrape/shopify_product and
/v1/scrape/ecommerce_product. Not in auto-dispatch because any
arbitrary /products/{slug} URL could belong to either platform
(or to a custom site that uses the same path shape), and claiming
such URLs blindly would steal from the default markdown /v1/scrape
flow.
Live test results against real stores:
- Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images,
Size option, all SKUs. 250ms.
- ecommerce_product / same Allbirds URL: ProductGroup schema, name
'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer.
300ms. Different extraction path, same product.
- ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand,
200ms.
- Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as
expected \u2014 the error message points callers at the ecommerce_product
fallback, but Cloudflare also blocks the HTML path so those stores
are cloud-tier territory.
Catalog now exposes 19 extractors via GET /v1/extractors. Unit
tests: 59 passing across the module.
Scope not in v1:
- trustpilot_reviews: file written and tested (JSON-LD walker), but
NOT registered in the catalog or dispatch. Trustpilot's Cloudflare
turnstile blocks our Firefox + Chrome + Safari + mobile profiles
at the TLS layer. Shipping it would return 403 more often than 200.
Code kept in-tree under #[allow(dead_code)] for when the cloud
tier has residential-proxy support.
- Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF
story. Not fixable without real browser + proxy pool.
- WooCommerce explicit: most WooCommerce stores ship Product JSON-LD,
so ecommerce_product covers them. A dedicated WooCommerce REST
extractor (/wp-json/wc/store/products) would be marginal on top of
that and only works on ~30% of stores that expose the REST API.
Wave 4 positioning: we now own the OSS structured-scrape space for
any site that respects Schema.org. That's Google's entire rich-result
index \u2014 meaningful territory competitors won't try to replicate as
named endpoints.
|
||
|---|---|---|
| .. | ||
| webclaw-cli | ||
| webclaw-core | ||
| webclaw-fetch | ||
| webclaw-llm | ||
| webclaw-mcp | ||
| webclaw-pdf | ||
| webclaw-server | ||