webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-08 22:25:12 +02:00

Author	SHA1	Message	Date
Valerio	a1abf625a0	build(deps): pin wreq/wreq-util to exact rc versions wreq is a release candidate with no API stability between rc.N builds (rc.29 broke the TLS + Response API). `cargo install` and the release workflow both ignore Cargo.lock and were re-resolving to rc.29, breaking the build. An exact `=6.0.0-rc.28` / `=3.0.0-rc.10` pin keeps every build path deterministic until wreq reaches a stable release.	2026-06-04 19:33:31 +02:00
Valerio	217bfe088b	feat(reddit): parse old.reddit.com HTML instead of the dead .json API Reddit blocked unauthenticated `.json` access, so the previous extractor returned block pages or timed out on every thread. Switch to parsing old.reddit.com's server-rendered HTML, which needs no API key or JS. Fetch layer: - Rewrite every Reddit host to old.reddit.com before fetching; drop all `.json` URL handling and the JSON response parser. Extraction (webclaw-core::reddit): - New HTML parser producing a typed post + nested comment tree. - Comments nest structurally (.comment > .child > .sitetable > .comment); old.reddit omits a usable depth attribute, so the tree is walked recursively. Bodies live in .entry > form > .usertext-body > .md. - Post metadata: title, author, subreddit, score, comment count (data-comments-count), self-vs-link (self class / self.* domain), flair, self-text body. - Comment scores read the .score.unvoted title (the displayed value, not the ±1 vote-state siblings); hidden scores are None, not 0. - Deleted comments are kept in place so their replies aren't orphaned; "load more comments" stubs are skipped. Markdown output: - Reply nesting via blockquote depth (avoids 4-space indentation turning text and code fences into broken indented-code blocks). - Links keep their target as [text](url); root-relative reddit links resolve against old.reddit.com. Nested lists indent correctly. - A recognised but unparseable /comments/ page returns no content rather than falling through to generic extraction of Reddit chrome. Tests: regression suite runs against real old.reddit.com fixtures (testdata/reddit/), the ground truth that surfaced the parsing and markdown bugs synthetic HTML had hidden. Fixtures are excluded from the published crate.	2026-06-04 17:36:02 +02:00
Valerio	fe567a6af1	feat(core): endpoints module for API surface extraction from HTML and JS (#47 ) * feat(core): endpoints module — extract API surface from HTML + JS bundles * fix(docker): source CA bundle from distroless instead of apt (fixes arm64 release build) * fix(test): serialize env-mutating CloudClient tests to stop flaky CI * feat(core): filter endpoint-extractor noise (invalid hosts, schema domains, bare paths)	2026-05-19 19:05:16 +02:00
Valerio	be8bcfebd9	fix: harden resource limits, path safety, and WASM build (#46 ) Security audit follow-up across the workspace: - webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a cfg(not(wasm32)) target dependency and the extraction entry point uses a direct call on wasm instead of spawning a thread, so it builds and runs on wasm32 with or without default features. - webclaw-core: bound the structured-data scrubber recursion (depth cap) so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the stack. - webclaw-fetch: stream the response body with a running ceiling so a small highly compressed payload cannot inflate to gigabytes in memory; redact user:pass@ from proxy URLs before they reach error strings. - webclaw-cli: contain output filenames inside the chosen directory (reject .. / absolute, drop traversal path segments), run --webhook URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s, and make research slug truncation char-safe. - webclaw-mcp: char-safe slug truncation (no multibyte slice panic). - setup.sh / deploy/hetzner.sh: replace eval on read input with printf -v, and mask auth key / API token in console output. - CI: enforce the wasm32 build invariant for webclaw-core. Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.	2026-05-19 17:03:52 +02:00
Valerio	fd2e75d509	chore(fetch): satisfy clippy for resolver setup	2026-05-12 12:09:18 +02:00
Valerio	307b4f980d	fix(extractors): harden marketplace host matching	2026-05-12 12:03:43 +02:00
Valerio	3bcb288d13	fix(fetch): guard challenge detection before utf8 decoding	2026-05-12 12:00:47 +02:00
Valerio	a611ae26f3	fix(security): harden local fetch surfaces	2026-05-12 12:00:25 +02:00
Valerio	bdf81fe6bf	fix: harden fetch URL validation	2026-05-04 11:50:57 +02:00
Valerio	966981bc42	fix(fetch): send bot-identifying UA on reddit .json API to bypass browser UA block Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-04-23 15:17:04 +02:00
Valerio	866fa88aa0	fix(fetch): reject HTML verification pages served at .json reddit URL	2026-04-23 15:06:35 +02:00
Valerio	b413d702b2	feat(fetch): add fetch_smart with Reddit + Akamai rescue paths, bump 0.5.6	2026-04-23 14:59:29 +02:00
Valerio	b77767814a	Bump to 0.5.4: SafariIos profile + Chrome fingerprint alignment + locale helper - New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26. Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers, gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3 8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on immobiliare.it with country-it residential. - BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256, explicit extension_permutation, advertise h3 in ALPN and ALPS. JA3 43067709b025da334de1279a120f8e14, akamai_fp 52d84b11737d980aef856699f885ca86. Fixes indeed.com and other Cloudflare-fronted sites. - New locale module: accept_language_for_url / accept_language_for_tld. TLD to Accept-Language mapping, unknown TLDs default to en-US. DataDome geo-vs-locale cross-checks are now trivially satisfiable. - wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.	2026-04-23 12:58:24 +02:00
Valerio	058493bc8f	feat(fetch): Fetcher trait so vertical extractors work under any HTTP backend Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now take `client: &dyn Fetcher` instead of `client: &FetchClient` directly. Backwards-compatible: FetchClient implements Fetcher, blanket impls cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server callers keep working unchanged. Motivation: the production API server (api.webclaw.io) must not do in-process TLS fingerprinting; it delegates all HTTP to the Go tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on production would have required importing wreq into the server's dep graph, violating the CLAUDE.md rule. Now production can provide its own TlsSidecarFetcher implementation and pass it to the same dispatcher the OSS server uses. Changes: - New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus blanket impls for `&T` and `Arc<T>`. - `FetchClient` gains a tiny impl block in client.rs that forwards to its existing public methods. - All 28 extractor signatures migrated from `&FetchClient` to `&dyn Fetcher` (sed-driven bulk rewrite, no semantic change). - `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`. - `extractors::dispatch_by_url` and `extractors::dispatch_by_name` take `&dyn Fetcher`. - `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has native async-fn-in-trait but dyn dispatch still needs async_trait). - Version bumped to 0.5.1, CHANGELOG updated. Tests: 215 passing in webclaw-fetch (no new tests needed — the existing extractor tests exercise the trait methods transparently). Clippy: clean workspace-wide.	2026-04-22 21:17:50 +02:00
Valerio	b2e7dbf365	fix(extractors): perfect-score follow-ups (trustpilot 2025 schema, amazon/etsy fallbacks, cloud docs) Addresses the four follow-ups surfaced by the cloud-key smoke test. trustpilot_reviews — full rewrite for 2025 schema: - Trustpilot moved from single-Organization+aggregateRating to three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table mainEntity carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent review objects. - Parser now: skips the site-level Org, walks @graph as either array or single object, picks the Dataset whose about.@id references the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and turns aiSummaryReviews into a clean reviews array with author/country/date/rating/title/text/likes. - OG-title regex fallbacks for business_name, rating_label, and average_rating when the Dataset block is absent. OG-description regex for review_count. - Returned shape: url, domain, business_name, rating_label, average_rating, review_count, rating_distribution (per-star count and percent), ai_summary, recent_reviews, review_count_listed, data_source. - Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 / 226 reviews with full distribution + AI summary + 2 recent reviews. amazon_product — force-cloud-escalation + OG fallback: - Amazon serves Product JSON-LD intermittently even on non-CAPTCHA pages. When local fetch returns HTML without Product JSON-LD and a cloud client is configured, force-escalate to the cloud path which reliably surfaces title + description via its render engine. - New OG meta-tag fallback for title/image/description so the cloud's synthesize_html output (OG tags only, no #productTitle DOM ID) still yields useful data. Real Amazon pages still prefer the DOM regex. - Verified live: B0BSHF7WHW escalates to cloud, returns Apple MacBook Pro title + description + asin. etsy_listing — slug humanization + generic-page filtering + shop from brand: - Etsy serves various placeholder pages when a listing is delisted, blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...", "This item is unavailable - Etsy", plus the OG description "Sorry, the page you were looking for was not found." is_generic_* helpers catch all three shapes. - When the OG title is generic, humanise the URL slug: the path `/listing/123456789/personalized-stainless-steel-tumbler` becomes `Personalized Stainless Steel Tumbler` so callers always get a meaningful title even on dead listings. - Etsy uses `brand` (top-level JSON-LD field) for the shop name on listings that don't ship offers[].seller.name. Shop now falls through offers -> brand so either schema resolves. - Verified live: listing/1097462299 returns full rich data (title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating / 225 reviews, InStock). cloud.rs — module doc update: - Added an architecture section documenting that api.webclaw.io does not return raw HTML by design and that [`synthesize_html`] reassembles the parsed response (metadata + structured_data + markdown) back into minimal HTML so existing local parsers run unchanged across both paths. Also notes the DOM-regex limitation for extractors that need live-page-specific DOM IDs. Tests: 215 passing in webclaw-fetch (18 new), clippy clean. Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY: 28/28 clean, 0 partial, 0 failed.	2026-04-22 17:49:50 +02:00
Valerio	e10066f527	fix(cloud): synthesize HTML from cloud response instead of requesting raw html api.webclaw.io/v1/scrape does not return a `html` field even when `formats=["html"]` is requested, by design: the cloud API returns pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags, title, description, image, site_name), and `markdown`. Our CloudClient::fetch_html helper was premised on the API returning raw HTML. Without a key set, the error message was hidden behind CloudError::NotConfigured so the bug never surfaced. With a key set, every extractor that escalated to cloud (trustpilot_reviews, etsy_listing, amazon_product, ebay_listing, substack_post HTML fallback) got back "cloud /v1/scrape returned no html field". Fix: reassemble a minimal synthetic HTML document from the cloud's parsed output. Each JSON-LD block goes back into a `<script type="application/ld+json">` tag, metadata fields become OG `<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing local extractor parsers (find_product_jsonld, find_business, og() regex) see the same shapes they'd see from a real page, so no per-extractor changes needed. Verified end-to-end with WEBCLAW_CLOUD_API_KEY set: - trustpilot_reviews: escalates, returns Organization JSON-LD data (parser picks Trustpilot site-level Org not the reviewed business; tracked as a follow-up to update Trustpilot schema handling) - etsy_listing: escalates via antibot render path; listing-specific data depends on target listing having JSON-LD (many Etsy listings don't) - amazon_product, ebay_listing: stay local because their pages ship enough content not to trigger bot-detection escalation - The other 24 extractors unchanged (local path, zero cloud credits) Tests: 200 passing in webclaw-fetch (3 new), clippy clean.	2026-04-22 17:24:50 +02:00
Valerio	a53578e45c	fix(extractors): detect AWS WAF verifying-connection page, add OG fallback to ecommerce_product Two targeted fixes surfaced by the manual extractor smoke test. cloud::is_bot_protected: - Trustpilot serves a ~565-byte AWS WAF interstitial with the string "Verifying your connection..." and an `interstitial-spinner` div. That pattern was not in our detector, so local fetch returned the challenge page, JSON-LD parsing found nothing, and the extractor emitted a confusing "no Organization/LocalBusiness JSON-LD" error. - Added the pattern plus a <10KB size gate so real articles that happen to mention the phrase aren't misclassified. Two new tests cover positive + negative cases. - With the fix, trustpilot_reviews now correctly escalates via smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY" actionable error without a key, or cloud-bypassed HTML with one. ecommerce_product: - Previously hard-failed when a page had no Product JSON-LD, and produced an empty `offers` list when JSON-LD was present but its `offers` node was. Many sites (Patagonia-style catalog pages, smaller Squarespace stores) ship one or the other of OG / JSON-LD but not both with price data. - Added OG meta-tag fallback that handles: * no JSON-LD at all -> build minimal payload from og:title, og:image, og:description, product:price:amount, product:price:currency, product:availability, product:brand * JSON-LD present but offers empty -> augment with an OG-derived offer so price comes through - New `data_source` field: "jsonld", "jsonld+og", or "og_fallback" so callers can tell which branch populated the data. - `has_og_product_signal()` requires og:type=product or a price tag so blog posts don't get mis-classified as products. Tests: 197 passing in webclaw-fetch (6 new), clippy clean.	2026-04-22 17:07:31 +02:00
Valerio	7f5eb93b65	feat(extractors): wave 6b, etsy_listing + HTML fallbacks for substack/youtube Adds etsy_listing and hardens two existing extractors with HTML fallbacks so transient API failures still return useful data. New: - etsy_listing: /listing/{id}(/slug) with Schema.org Product JSON-LD + OG fallback. Antibot-gated, routes through cloud::smart_fetch_html like amazon_product and ebay_listing. Auto-dispatched (etsy host is unique). Hardened: - substack_post: when /api/v1/posts/{slug} returns non-200 (rate limit, 403 on hardened custom domains, 5xx), fall back to HTML fetch and parse OG tags + Article JSON-LD. Response shape is stable across both paths, with a `data_source` field of "api" or "html_fallback". - youtube_video: when ytInitialPlayerResponse is missing (EU-consent interstitial, age-gated, some live pre-shows), fall back to OG tags for title/description/thumbnail. `data_source` now "player_response" or "og_fallback". Tests: 91 passing in webclaw-fetch (9 new), clippy clean.	2026-04-22 16:44:51 +02:00
Valerio	8cc727c2f2	feat(extractors): wave 6a, 5 easy verticals (27 total) Adds 5 structured extractors that hit public APIs with stable shapes: - github_issue: /repos/{o}/{r}/issues/{n} (rejects PRs, points to github_pr) - shopify_collection: /collections/{handle}.json + products.json - woocommerce_product: /wp-json/wc/store/v1/products?slug={slug} - substack_post: /api/v1/posts/{slug} (works on custom domains too) - youtube_video: ytInitialPlayerResponse blob from /watch HTML Auto-dispatched: github_issue, youtube_video (unique hosts and stable URL shapes). Explicit-call: shopify_collection, woocommerce_product, substack_post (URL shapes overlap with non-target sites). Tests: 82 total passing in webclaw-fetch (12 new), clippy clean.	2026-04-22 16:33:35 +02:00
Valerio	d8c9274a9c	feat(extractors): wave 5 \u2014 Amazon, eBay, Trustpilot via cloud fallback Three hard-site extractors that all require antibot bypass to ever return usable data. They ship in OSS so the parsers + schema live with the rest of the vertical extractors, but the fetch path routes through cloud::smart_fetch_html \u2014 meaning: - With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on challenge-page detection we escalate to api.webclaw.io/v1/scrape with formats=['html'] and parse the antibot-bypassed HTML locally. - Without a cloud key, callers get a typed CloudError::NotConfigured whose Display message points at https://webclaw.io/signup. Self-hosters without a webclaw.io account know exactly what to do. ## New extractors (all auto-dispatched \u2014 unique hosts) - amazon_product: ASIN extraction from /dp/, /gp/product/, /product/, /exec/obidos/ASIN/ URL shapes across every amazon.* locale. Parses the Product JSON-LD Amazon ships for SEO; falls back to #productTitle and #landingImage DOM selectors when JSON-LD is absent. Returns price, currency, availability, condition, brand, image, aggregate rating, SKU / MPN. - ebay_listing: item-id extraction from /itm/{id} and /itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr / .it. Parses both bare Offer (Buy It Now) and AggregateOffer (used-copies / auctions) from the Product JSON-LD. Returns price or low/high-price range, currency, condition, seller, offer_count, aggregate rating. - trustpilot_reviews: reactivated from the `trustpilot_reviews` file that was previously dead-code'd. Parser already worked; it just needed the smart_fetch_html path to get past AWS WAF's 'Verifying Connection' interstitial. Organisation / LocalBusiness JSON-LD block gives aggregate rating + up to 20 recent reviews. ## FetchClient change - Added optional `cloud: Option<Arc<CloudClient>>` field with `FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)` accessor. Extractors call client.cloud() to decide whether they can escalate. Cheap clones (Arc-wrapped). ## webclaw-server wiring AppState::new() now reads the cloud credential from env: 1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the server's own inbound bearer token. 2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open mode (no inbound-auth key set), matching the MCP / CLI convention of that env var. When present, state.rs builds a CloudClient and attaches it to the FetchClient via with_cloud(). Log line at startup so operators see when cloud fallback is active. ## Catalog + dispatch All three extractors registered in list() and in dispatch_by_url. /v1/extractors catalog now exposes 22 verticals. Explicit /v1/scrape/{vertical} routes work per the existing pattern. ## Tests - 7 new unit tests (parse_asin multi-shape + parse from JSON-LD fixture + DOM-fallback on missing JSON-LD for Amazon; ebay URL-matching + slugged-URL parsing + both Offer and AggregateOffer fixtures). - Full extractors suite: 68 passing (was 59, +9 from the new files). - fmt + clippy clean. - No live-test story for these three inside CI \u2014 verifying them means having WEBCLAW_CLOUD_API_KEY set against a real cloud backend. Integration-test harness is a separate follow-up. Catalog summary: 22 verticals total across wave 1-5. Hard-site three are gated behind an actionable cloud-fallback upgrade path rather than silently returning nothing or 403-ing the caller.	2026-04-22 16:16:11 +02:00
Valerio	0ab891bd6b	refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch:☁️:*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.	2026-04-22 16:05:44 +02:00
Valerio	0221c151dc	feat(extractors): wave 4 \u2014 ecommerce (shopify + generic JSON-LD) Two ecommerce extractors covering the long tail of online stores: - shopify_product: hits the public /products/{handle}.json endpoint that every Shopify store exposes. Undocumented but stable for 10+ years. Returns title, vendor, product_type, tags, full variants array (price, SKU, stock, options), images, options matrix, and the price_min/price_max/any_available summary fields. Covers the ~4M Shopify stores out there, modulo stores that put Cloudflare in front of the shop. Rejects known non-Shopify hosts (amazon, etsy, walmart, etc.) to save a failed request. - ecommerce_product: generic Schema.org Product JSON-LD extractor. Works on any modern store that ships the Google-required Product rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace, Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN, images, normalized offers (Offer and AggregateOffer flattened into one shape with price, currency, availability, condition), aggregateRating, and the raw JSON-LD block for anyone who wants it. Reuses webclaw_core::structured_data::extract_json_ld so the JSON-LD parser stays shared across the extraction pipeline. Both are explicit-call only — /v1/scrape/shopify_product and /v1/scrape/ecommerce_product. Not in auto-dispatch because any arbitrary /products/{slug} URL could belong to either platform (or to a custom site that uses the same path shape), and claiming such URLs blindly would steal from the default markdown /v1/scrape flow. Live test results against real stores: - Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images, Size option, all SKUs. 250ms. - ecommerce_product / same Allbirds URL: ProductGroup schema, name 'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer. 300ms. Different extraction path, same product. - ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand, 200ms. - Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as expected \u2014 the error message points callers at the ecommerce_product fallback, but Cloudflare also blocks the HTML path so those stores are cloud-tier territory. Catalog now exposes 19 extractors via GET /v1/extractors. Unit tests: 59 passing across the module. Scope not in v1: - trustpilot_reviews: file written and tested (JSON-LD walker), but NOT registered in the catalog or dispatch. Trustpilot's Cloudflare turnstile blocks our Firefox + Chrome + Safari + mobile profiles at the TLS layer. Shipping it would return 403 more often than 200. Code kept in-tree under #[allow(dead_code)] for when the cloud tier has residential-proxy support. - Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF story. Not fixable without real browser + proxy pool. - WooCommerce explicit: most WooCommerce stores ship Product JSON-LD, so ecommerce_product covers them. A dedicated WooCommerce REST extractor (/wp-json/wc/store/products) would be marginal on top of that and only works on ~30% of stores that expose the REST API. Wave 4 positioning: we now own the OSS structured-scrape space for any site that respects Schema.org. That's Google's entire rich-result index \u2014 meaningful territory competitors won't try to replicate as named endpoints.	2026-04-22 15:36:01 +02:00
Valerio	3bb0a4bca0	feat(extractors): add LinkedIn + Instagram with profile-to-posts fan-out 3 social-network extractors that work entirely without auth, using public embed/preview endpoints + Instagram's own SEO-facing API: - linkedin_post: /embed/feed/update/{urn} returns full body, author, image, OG tags. Accepts both the urn:li:share and urn:li:activity URN forms plus the pretty /posts/{slug}-{id}-{suffix} URLs. - instagram_post: /p/{shortcode}/embed/captioned/ returns the full caption, username, thumbnail. Same endpoint serves reels and IGTV, kind correctly classified. - instagram_profile: /api/v1/users/web_profile_info/?username=X with the x-ig-app-id header (Instagram's public web-app id, sent by their own JS bundle). Returns the full profile + the 12 most recent posts with shortcodes, kinds, like/comment counts, thumbnails, and caption previews. Falls back to OG-tag scraping of the public HTML if the API ever 401/403s. The IG profile output is shaped so callers can fan out cleanly: for p in profile.recent_posts: scrape('instagram_post', p.url) giving you 'whole profile + every recent post' in one loop. End-to-end tested against ticketswave: 1 profile call + 12 post calls in ~3.5s. Pagination beyond 12 posts requires authenticated cookies and is left for the cloud where we can stash a session. Infrastructure change: added FetchClient::fetch_with_headers so extractors can satisfy site-specific request headers (here x-ig-app-id; later github_pr will use this for Authorization, etc.) without polluting the global FetchConfig.headers map. Same retry semantics as fetch(). Catalog now exposes 17 extractors via /v1/extractors. Total unit tests across the module: 47 passing. Clippy clean. Fmt clean. Live test on the maintainer's example URLs: - LinkedIn post (urn:li:share:7452618582213144577): 'Orc Dev' / full body / shipper.club link / CDN image extracted in 250ms. - Instagram post (DT-RICMjeK5): 835-char Slovak caption, ticketswave username, thumbnail. 200ms. - Instagram profile (ticketswave): 18,473 followers (exact, not rounded), is_verified=True, is_business=True, biography with emojis, 12 recent posts with shortcodes + kinds + likes. 400ms. Out of scope for this wave (require infra we don't have): - linkedin_profile: returns 999 to all bot UAs, needs OAuth - facebook_post / facebook_page: content is JS-loaded, needs cloud Chrome - facebook_profile (personal): not publicly accessible by design	2026-04-22 14:39:49 +02:00
Valerio	b041f3cddd	feat(extractors): wave 2 \u2014 8 more verticals (14 total) Adds 8 more vertical extractors using public JSON APIs. All hit deterministic endpoints with no antibot risk. Live tests pass against canonical URLs for each. AI / ML ecosystem (3): - crates_io \u2192 crates.io/api/v1/crates/{name} - huggingface_dataset \u2192 huggingface.co/api/datasets/{path} (handles both legacy /datasets/{name} and canonical {owner}/{name}) - arxiv \u2192 export.arxiv.org/api/query (Atom XML parsed by quick-xml) Code / version control (2): - github_pr \u2192 api.github.com/repos/{owner}/{repo}/pulls/{number} - github_release \u2192 api.github.com/repos/{owner}/{repo}/releases/tags/{tag} Infrastructure (1): - docker_hub \u2192 hub.docker.com/v2/repositories/{namespace}/{name} (official-image shorthand /_/nginx normalized to library/nginx) Community / publishing (2): - dev_to \u2192 dev.to/api/articles/{username}/{slug} - stackoverflow \u2192 api.stackexchange.com/2.3/questions/{id} + answers, filter=withbody for rendered HTML, sort=votes for consistent top-answers ordering Live test results (real URLs): - serde: 942M downloads, 838B response - 'Attention Is All You Need': abstract + authors, 1.8KB - nginx official: 12.9B pulls, 21k stars, 17KB - openai/gsm8k: 822k downloads, 1.7KB - rust-lang/rust#138000: merged by RalfJung, +3/-2, 1KB - webclaw v0.4.0: 2.4KB - a real dev.to article: 2.2KB body, 3.1KB total - python yield Q&A: score 13133, 51 answers, 104KB Catalog now exposes 14 extractors via GET /v1/extractors. Total unit tests across the module: 34 passing. Clippy clean. Fmt clean. Marketing positioning sharpens: 14 dedicated extractors, all deterministic, all 1-credit-per-call. Firecrawl's /extract is 5 credits per call and you write the schema yourself.	2026-04-22 14:20:21 +02:00
Valerio	8ba7538c37	feat(extractors): add vertical extractors module + first 6 verticals New extractors module returns site-specific typed JSON instead of generic markdown. Each extractor: - declares a URL pattern via matches() - fetches from the site's official JSON API where one exists - returns a typed serde_json::Value with documented field names - exposes an INFO struct that powers the /v1/extractors catalog First 6 verticals shipped, all hitting public JSON APIs (no HTML scraping, zero antibot risk): - reddit → www.reddit.com/*/.json - hackernews → hn.algolia.com/api/v1/items/{id} (full thread in one call) - github_repo → api.github.com/repos/{owner}/{repo} - pypi → pypi.org/pypi/{name}/json - npm → registry.npmjs.org/{name} + downloads/point/last-week - huggingface_model → huggingface.co/api/models/{owner}/{name} Server-side routes added: - POST /v1/scrape/{vertical} explicit per-vertical extraction - GET /v1/extractors catalog (name, label, description, url_patterns) The dispatcher validates that URL matches the requested vertical before running, so users get "URL doesn't match the X extractor" instead of opaque parse failures inside the extractor. 17 unit tests cover URL matching + path parsing for each vertical. Live tests against canonical URLs (rust-lang/rust, requests pypi, react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post) all return correct typed JSON in 100-300ms. Sample sizes: github 863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree). Marketing positioning: Firecrawl charges 5 credits per /extract call and you write the schema. Webclaw returns the same JSON in 1 credit per /scrape/{vertical} call with hand-written deterministic extractors per site.	2026-04-22 14:11:43 +02:00
Valerio	095ae5d4b1	polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Three P3 items from the 2026-04-16 audit. Bump to 0.3.17. webclaw-fetch/sitemap.rs: parse_robots_txt used trimmed[..8] slice plus eq_ignore_ascii_case for the directive test. That was fragile: "Sitemap :" (space before colon) fell through silently, inline "# ..." comments leaked into the URL, and a line with no URL at all returned an empty string. Rewritten to split on the first colon, match any-case "sitemap" as the directive name, strip comments, and require `://` in the value. +7 unit tests cover case variants, space-before-colon, comments, empty values, non-URL values, and non-sitemap directives. webclaw-fetch/crawler.rs: is_cancelled uses Ordering::Acquire instead of Relaxed. Behaviourally equivalent on current hardware for single-word atomic loads, but the explicit ordering documents intent for readers + compilers. webclaw-mcp/server.rs: add lazy OnceLock cache for the Firefox FetchClient. Tool calls that repeatedly request the firefox profile without cookies used to build a fresh reqwest pool + TLS stack per call. Chrome (default) already used the long-lived field; Random is per-call by design; cookie-bearing requests still build ad-hoc since the cookie header is part of the client shape. Tests: 85 webclaw-fetch (was 78, +7 new sitemap), 272 webclaw-core, 43 webclaw-llm, 11 CLI — all green. Clippy clean across workspace. Refs: docs/AUDIT-2026-04-16.md P3 section Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 20:21:32 +02:00
Valerio	d69c50a31d	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run * feat(fetch,llm): DoS hardening via response caps + glob validation (P2) Response body caps: - webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks Content-Length up front (before the allocation) and the actual .bytes() length after (belt-and-braces against lying upstreams). Previously the HTML -> markdown conversion downstream could allocate multiple String copies per page; a 100 MB page would OOM the process. - webclaw-llm providers (anthropic/openai/ollama) share a new response_json_capped helper with a 5 MB cap. Protects against a malicious or runaway provider response exhausting memory. Crawler frontier cap: after each BFS depth level the frontier is truncated to max(max_pages * 10, 100) entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after max_pages halted new fetches. Glob pattern validation: user-supplied include_patterns / exclude_patterns are rejected at Crawler::new if they contain more than 4 `` wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested `` against long paths. Cleanup: - Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs; no warnings surfaced, the suppression was obsolete. - core/.gitignore: replaced overbroad .json with specific local- artifact patterns (previous rule would have swallowed package.json, components.json, .smithery/.json). Tests: +4 validate_glob tests. Full workspace test: 283 passed (webclaw-core + webclaw-fetch + webclaw-llm). Version: 0.3.15 -> 0.3.16 CHANGELOG updated. Refs: docs/AUDIT-2026-04-16.md (P2 section) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: gitignore CLI research dumps, drop accidentally-tracked file research-.json output from `webclaw ... --research ...` got silently swept into git by the relaxed .json gitignore in the preceding commit. The old blanket .json rule was hiding both this legitimate scratch file AND packages/create-webclaw/server.json (MCP registry config that we DO want tracked). Removes the research dump from git and adds a narrower research-.json ignore pattern so future CLI output doesn't get re-tracked by accident. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 19:44:08 +02:00
Valerio	7773c8af2a	fix(fetch): surface semaphore-closed as typed error instead of panic (P1) (#21 ) Three call sites in webclaw-fetch used .expect("semaphore closed") on `Semaphore::acquire()`. Under normal operation they never fire, but under a shutdown race or adversarial runtime state the spawned task would panic and be silently dropped from the batch / crawl run — the caller would see fewer results than URLs with no indication why. Rewritten to match on the acquire result: - client::fetch_batch and client::fetch_and_extract_batch_with_options now emit BatchResult/BatchExtractResult carrying FetchError::Build("semaphore closed before acquire"). - crawler's inner loop emits a failed PageResult with the same error string instead of panicking. Behaviorally a no-op for the happy path. Fixes the silent-dropped-task class of bug noted in the 2026-04-16 audit. Version: 0.3.14 -> 0.3.15 CHANGELOG updated. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 19:20:26 +02:00
Valerio	6316b1a6e7	fix: handle raw newlines in JSON-LD strings Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Sites like Bluesky emit JSON-LD with literal newline characters inside string values (technically invalid JSON). Add sanitize_json_newlines() fallback that escapes control characters inside quoted strings before retrying the parse. This recovers ProfilePage, Product, and other structured data that was previously silently dropped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 11:40:25 +02:00
Valerio	050b2ef463	feat: add allow_subdomains and allow_external_links to CrawlConfig Crawls are same-origin by default. Enable allow_subdomains to follow sibling/child subdomains (blog.example.com from example.com), or allow_external_links for full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au). Includes 5 unit tests for root_domain(). Bump to 0.3.12. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:33:06 +02:00
Valerio	a4c351d5ae	feat: add fallback sitemap paths for broader discovery Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms use non-standard paths that were previously missed. Paths found via robots.txt are deduplicated to avoid double-fetching. Bump to 0.3.11. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 18:22:57 +02:00
Valerio	25b6282d5f	style: fix rustfmt for 2-element delay array Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 17:21:53 +02:00
Valerio	954aabe3e8	perf: reduce fetch timeout to 12s and retries to 2 Stress testing showed 33% of proxies are dead, causing 30s+ timeouts per request with 3 retries (worst case 94s). Reducing timeout from 30s to 12s and retries from 3 to 2 brings worst case to 25s. Combined with disabling 509 dead proxies from the pool, this should significantly improve response times under load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 17:18:57 +02:00
Valerio	124352e0b4	style: cargo fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:25:40 +02:00
Valerio	aaf51eddef	feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3 Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest) to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles. This removes all 5 [patch.crates-io] entries that consumers previously needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145) are now built directly on wreq's Emulation API with correct TLS options, HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order. 84% pass rate across 1000 real sites. 384 unit tests green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:04:55 +02:00
Valerio	44f23332cc	style: collapse nested if per clippy Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:13:55 +02:00
Valerio	20c810b8d2	chore: bump v0.3.1, update CHANGELOG, fix fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:11:54 +02:00
Valerio	7041a1d992	feat: cookie warmup fallback for Akamai-protected pages When a fetch returns a challenge page (small HTML with Akamai markers), automatically visit the homepage first to collect _abck/bm_sz cookies, then retry the original URL. This bypasses Akamai's cookie-based gate on subpages without needing JS execution. Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr sensor marker on responses under 15KB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:09:31 +02:00
Valerio	4cba36337b	style: fix fmt in client.rs test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:18:57 +02:00
Valerio	199dab6dfa	fix: adapt to webclaw-tls v0.1.1 HeaderMap API change Response.headers() now returns &http::HeaderMap instead of &HashMap<String, String>. Updated FetchResult, is_pdf_content_type, is_document_content_type, is_bot_protected, and all related tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:09:50 +02:00
Valerio	e3b0d0bd74	fix: make reddit and linkedin modules public for server access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:54:35 +02:00
Valerio	f275a93bec	fix: clippy empty-line-after-doc-comment in browser.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:45:05 +02:00
Valerio	140234c139	style: cargo fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:43:11 +02:00
Valerio	f13cb83c73	feat: replace primp with webclaw-tls, bump to v0.3.0 Replace primp dependency with our own TLS fingerprinting stack (webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match. - Remove primp entirely (zero references remaining) - webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls - Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains) - Skip unknown certificate extensions (SCT tolerance) - 99% bypass rate on 102 sites (was ~85% with primp) - Fixes #5 (HTTPS broken — example.com and similar sites now work) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:40:10 +02:00
Valerio	ea14848772	feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM Document extraction: - DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml) - XLSX/XLS: markdown tables with multi-sheet support (via calamine) - CSV: quoted field handling, markdown table output - All auto-detected by Content-Type header or URL extension New features: - -f html output format (sanitized HTML) - Multi-URL watch: --urls-file + --watch monitors all URLs in parallel - Batch + LLM: --extract-prompt/--extract-json works with multiple URLs - Mixed batch: HTML pages + DOCX + XLSX + CSV in one command Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:28:23 +01:00
Valerio	0e4128782a	fix: v0.1.7 — extraction options now work in batch mode (#3 ) --only-main-content, --include, and --exclude were ignored in batch mode because run_batch used default ExtractionOptions. Added fetch_and_extract_batch_with_options to pass CLI options through. Closes #3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 13:30:20 +01:00
Valerio	0c91c6d5a9	feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support Crawl: - Real-time progress on stderr as pages complete - --crawl-state saves progress on Ctrl+C, resumes from saved state - Visited set + remaining frontier persisted for accurate resume MCP server: - Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars - Falls back to proxies.txt in CWD (existing behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 21:38:28 +01:00
Valerio	afe4d3077d	feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra - Switch default profile to Safari26/Mac (best CF pass rate) - Auto-fallback to plain client on connection error or 403 - Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites - Reddit .json endpoint uses plain client (TLS fingerprint was blocked) - YouTube caption track extraction + timed text parser (core, not yet wired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 18:50:07 +01:00
Valerio	907966a983	fix: use plain client for Reddit JSON endpoint Reddit blocks TLS-fingerprinted clients on their .json API but accepts standard requests with a browser User-Agent. Switch to a non-impersonated primp client for the Reddit fallback path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 18:43:47 +01:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

50 commits