webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-04-25 00:06:21 +02:00

Valerio 0221c151dc feat(extractors): wave 4 \u2014 ecommerce (shopify + generic JSON-LD) Two ecommerce extractors covering the long tail of online stores: - shopify_product: hits the public /products/{handle}.json endpoint that every Shopify store exposes. Undocumented but stable for 10+ years. Returns title, vendor, product_type, tags, full variants array (price, SKU, stock, options), images, options matrix, and the price_min/price_max/any_available summary fields. Covers the ~4M Shopify stores out there, modulo stores that put Cloudflare in front of the shop. Rejects known non-Shopify hosts (amazon, etsy, walmart, etc.) to save a failed request. - ecommerce_product: generic Schema.org Product JSON-LD extractor. Works on any modern store that ships the Google-required Product rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace, Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN, images, normalized offers (Offer and AggregateOffer flattened into one shape with price, currency, availability, condition), aggregateRating, and the raw JSON-LD block for anyone who wants it. Reuses webclaw_core::structured_data::extract_json_ld so the JSON-LD parser stays shared across the extraction pipeline. Both are explicit-call only — /v1/scrape/shopify_product and /v1/scrape/ecommerce_product. Not in auto-dispatch because any arbitrary /products/{slug} URL could belong to either platform (or to a custom site that uses the same path shape), and claiming such URLs blindly would steal from the default markdown /v1/scrape flow. Live test results against real stores: - Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images, Size option, all SKUs. 250ms. - ecommerce_product / same Allbirds URL: ProductGroup schema, name 'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer. 300ms. Different extraction path, same product. - ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand, 200ms. - Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as expected \u2014 the error message points callers at the ecommerce_product fallback, but Cloudflare also blocks the HTML path so those stores are cloud-tier territory. Catalog now exposes 19 extractors via GET /v1/extractors. Unit tests: 59 passing across the module. Scope not in v1: - trustpilot_reviews: file written and tested (JSON-LD walker), but NOT registered in the catalog or dispatch. Trustpilot's Cloudflare turnstile blocks our Firefox + Chrome + Safari + mobile profiles at the TLS layer. Shipping it would return 403 more often than 200. Code kept in-tree under #[allow(dead_code)] for when the cloud tier has residential-proxy support. - Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF story. Not fixable without real browser + proxy pool. - WooCommerce explicit: most WooCommerce stores ship Product JSON-LD, so ecommerce_product covers them. A dedicated WooCommerce REST extractor (/wp-json/wc/store/products) would be marginal on top of that and only works on ~30% of stores that expose the REST API. Wave 4 positioning: we now own the OSS structured-scrape space for any site that respects Schema.org. That's Google's entire rich-result index \u2014 meaningful territory competitors won't try to replicate as named endpoints.		2026-04-22 15:36:01 +02:00
..
webclaw-cli	feat(cli): add webclaw bench <url> subcommand (closes #26 )	2026-04-22 12:25:29 +02:00
webclaw-core	style: cargo fmt	2026-04-17 12:03:22 +02:00
webclaw-fetch	feat(extractors): wave 4 \u2014 ecommerce (shopify + generic JSON-LD)	2026-04-22 15:36:01 +02:00
webclaw-llm	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 )	2026-04-16 19:44:08 +02:00
webclaw-mcp	fix(mcp): silence dead-code warning on tool_router field (closes #30 )	2026-04-22 12:25:39 +02:00
webclaw-pdf	Initial release: webclaw v0.1.0 — web content extraction for LLMs	2026-03-23 18:31:11 +01:00
webclaw-server	fix(server): switch default browser profile to Firefox	2026-04-22 14:11:55 +02:00