Rescued from the stale perf/audit-fixes branch — the *perf-only* subset of
that branch's big mixed commit, ported cleanly onto current main with
byte-identical extraction output.
- markdown: hoist the `img[alt]` / `a[href]` selectors out of the per-node
noise path into `Lazy` statics (stop recompiling them per element).
- extractors: single shared `og()` / `parse_og()` module replaces the
per-field Open Graph re-scan duplicated across 7 vertical extractors
(amazon, ebay, ecommerce, etsy, substack, trustpilot, youtube). Each
vertical now does one pass. Raw-vs-unescaped behaviour preserved exactly.
- core: gate the QuickJS VM on a cheap marker check (skip it entirely when
the page has no JS-assigned data) and reuse the already-parsed document
instead of re-parsing the HTML.
- fetch: connection-pool tuning on the wreq client (connect_timeout, idle
pool, max-idle-per-host, tcp keepalive) for connection reuse.
Output-equivalence is covered by existing tests (amazon quot-entity,
trustpilot title parse, ecommerce/youtube/etsy/substack og fallbacks) — all
green. No new dependencies; no public API change.
Deliberately EXCLUDED from this slice (separate concerns bundled in the
original commit): the `#[non_exhaustive]` API-breaking changes, the LLM/PDF/
server reliability hardening (much already shipped in 0.6.8), the tooling
(cargo-deny, release profile, MSRV), and the retry-loop dedup refactor (a
code-cleanup with no runtime benefit — not worth churning client.rs for).
Original work by the prior author on perf/audit-fixes; this re-applies only
the performance subset onto main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to #58/#59, which fixed numeric params but left the booleans.
MCP clients (e.g. Claude Desktop) send `true` as the JSON string `"true"`,
which serde's default bool deserializer rejects with
`invalid type: string "true", expected a boolean`, failing the call.
Adds a `deser_opt_bool_or_str` helper (same untagged pattern as the #59
numeric helpers) that accepts a JSON boolean OR "true"/"false"
(case-insensitive, trimmed) and rejects anything else with a clear error.
Numeric-looking strings like "1" are intentionally NOT coerced to bool.
Applied to every Option<bool> tool param:
- scrape -> only_main_content
- crawl -> use_sitemap
- research -> deep
- search -> scrape (added by the standalone-search slice, #63)
16 unit tests (bool / "true"-string / absent->None / garbage->error per
field). No new dependencies.
Fixes#62.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main (fetch + CLI only — the original commit never touched the
server/MCP map surfaces).
`--map` used to return only what a site advertises in sitemap.xml, which
is nothing for sites with no sitemap (e.g. Hacker News) or a thin one.
Now discovery is layered:
- webclaw-fetch::discover_urls() / MapOptions — sitemaps first
(authoritative, carries lastmod/priority/changefreq); when the sitemap
is thin (< min_sitemap_urls) and the fallback is enabled, run a bounded
same-origin crawl and harvest links from every fetched page plus the
unfetched frontier, deduped against the sitemap set.
- sitemap.rs: gzip (.xml.gz) support via a new decode_sitemap_body() +
FetchClient::fetch_raw() (raw bytes, no lossy UTF-8); deeper index
recursion (3->5); 4 more fallback paths.
- CLI: --map-pages / --no-map-crawl / --map-limit; crawler logs now go to
stderr so `--map -f json` stays machine-parseable.
One new dependency: flate2 (already resolved in the lockfile transitively).
Includes the commit's unit tests (map dedup/origin, gzip decode). Original
work by the prior author on perf/audit-fixes; this re-applies only the map
slice onto main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rescued from the stale perf/audit-fixes branch and ported cleanly onto
current main. OSS surfaces can now search without the hosted webclaw API
when the caller supplies their own Serper.dev key (free at serper.dev).
- webclaw-fetch::search() — calls Serper.dev directly (plain wreq client;
a JSON API needs no fingerprinting) and, with scrape=true, fetches +
extracts the top result pages concurrently (bounded) via the caller's
FetchClient. parse_serper_organic() is pure and unit-tested.
- MCP `search` tool: local-first — uses SERPER_API_KEY when set, else
falls back to the hosted webclaw API. Adds country/lang/scrape params.
- OSS REST server: POST /v1/search, gated on SERPER_API_KEY (501 when
unset, with a setup hint). Adds ApiError::NotImplemented.
- CLI: `webclaw search <query> [--serper-key|SERPER_API_KEY] [--num]
[--country] [--lang] [--scrape] [--format]`.
No new dependencies (reuses futures-util already in the tree). Original
work by the prior author on perf/audit-fixes; this re-applies only the
search slice onto main.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a Google Gemini provider (Generative Language API) to the chain, ordered Ollama -> OpenAI -> Gemini -> Anthropic so Google credits are preferred with Anthropic as last-resort fallback. System->systemInstruction, assistant->model, json_mode->responseMimeType; model name validated before URL interpolation; maxOutputTokens defaults high for 2.5 thinking models. Also fixes AnthropicProvider default (retired claude-sonnet-4-20250514 -> 404); now claude-sonnet-4-6, honors ANTHROPIC_MODEL.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reformat the string-or-number deserialize helpers and tests to satisfy
`cargo fmt --check` (style_edition 2024), which the lint CI job enforces.
Formatting only — no behavior change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MCP clients (Claude Desktop, VS Code Copilot, etc.) serialize numeric
tool arguments as JSON strings ("3" instead of 3). serde's built-in
u32/usize deserialisers reject these with:
invalid type: string "N", expected u32
Add two private coercion helpers — `deser_opt_u32_or_str` and
`deser_opt_usize_or_str` — that accept both JSON number and JSON string
representations, falling back to `str::parse` for the string form and
returning a clear custom error for non-numeric strings.
Annotate the six affected optional fields:
CrawlParams: depth (u32), max_pages (usize), concurrency (usize)
BatchParams: concurrency (usize)
SearchParams: num_results (u32)
SummarizeParams: max_sentences (usize)
Add 24 unit tests (4 per field: numeric string → value, native number
→ value, absent → None, non-numeric string → Err) verified green via
an isolated serde-only crate.
Fixes#58
- webclaw-llm: add explicit request + connect timeouts to the reqwest
client in every provider (anthropic, openai, ollama) with a shorter
timeout on the ollama health check, so a stalled provider fails fast.
- webclaw-llm: fix a panic when truncating a provider error body that
contains multibyte characters near the 500-char cut (char-safe take).
- webclaw-core: snap the endpoint-scan budget cut to a UTF-8 char
boundary so oversized scripts with non-ASCII content no longer panic.
- webclaw-core: rewrite js_literal_to_json to copy raw bytes instead of
`byte as char`, preserving multibyte UTF-8 in SvelteKit string values
rather than producing Latin-1 mojibake.
- webclaw-cli: have fire_webhook return its JoinHandle and await it at
the crawl/batch/batch-llm call sites, removing the fixed 500ms sleeps.
- webclaw-mcp: drop the up-front DNS pre-validation loop in batch that
aborted the whole request on one bad URL; the fetch layer already
applies the same SSRF guard per URL and reports per-URL errors.
- webclaw-fetch: include the port in the warmup homepage URL so hosts
on a non-default port are warmed correctly.
Adds regression tests for the UTF-8 endpoint-scan and SvelteKit cases.
Ports the TLS/Response API breaks in the bump:
- certificate_compression_algorithms -> certificate_compressors with
wreq-util's BrotliCompressor/ZlibCompressor trait objects
- ExtensionType::APPLICATION_SETTINGS_NEW -> APPLICATION_SETTINGS (same
codepoint 17613)
- wreq_util::Emulation::SafariIos26.emulation() ->
Profile::SafariIos26.into_emulation(); Emulation fields are now public
so *_mut() accessors become direct field access; build() takes a Group
- Response::chunk() removed -> bytes_stream() (wreq 'stream' feature) with
the running body-size ceiling preserved; adds futures-util
Browser fingerprints verified unchanged on tls.peet.ws: Chrome JA3
43067709b025da334de1279a120f8e14, Safari iOS JA3 8d909525bd5bbb79f133d11cc05159fe.
When bash splits a URL at & or ? (a common foot-gun), webclaw
receives only the truncated prefix and silently fetches the wrong
page. Per issue #6:
1. Heuristic warning: if the URL ends with '&' or contains '?' with
no '=' after, emit a stderr warning before fetching:
# webclaw: warning: URL looks truncated (ends with '&' or '?'); did the shell split it? Quote the URL or use --url-encoded.
2. New flag --url-encoded: parallel input that asserts the user has
handled escaping. Suppresses the truncation warning since intent
is explicit.
Fetch proceeds in both cases; this is informational only. 4 new
tests in webclaw-cli. Workspace 720 -> 724.
(cherry picked from commit 4ef27fcd33)
Webclaw's default -t timeout is 30s; slow sites previously sat
silently with no feedback. Now during a fetch, every 10s of elapsed
time webclaw writes one line to stderr:
# webclaw: still fetching <URL> (Ns)
Fetches completing in under 10s emit nothing (the timer never fires).
Stdout output is untouched - pure feedback signal on stderr.
No timeout change. No new flags. Default behavior is augmented at
stderr only.
Implemented via tokio::select! between the fetch future and a
tokio::time::interval. Latency cost: a single tokio task spawn
and a 10s tick - microseconds on the fast path.
10 new tests in webclaw-fetch::progress::tests (none ignored; the
slow-future test uses a 50ms test interval to keep cargo test fast).
Workspace total 710 -> 720.
(cherry picked from commit 06f065cb08)
wreq is a release candidate with no API stability between rc.N builds
(rc.29 broke the TLS + Response API). `cargo install` and the release
workflow both ignore Cargo.lock and were re-resolving to rc.29, breaking
the build. An exact `=6.0.0-rc.28` / `=3.0.0-rc.10` pin keeps every build
path deterministic until wreq reaches a stable release.
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.
Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
`.json` URL handling and the JSON response parser.
Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
old.reddit omits a usable depth attribute, so the tree is walked
recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
(data-comments-count), self-vs-link (self class / self.* domain),
flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
"load more comments" stubs are skipped.
Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
than falling through to generic extraction of Reddit chrome.
Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
Security audit follow-up across the workspace:
- webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a
cfg(not(wasm32)) target dependency and the extraction entry point uses
a direct call on wasm instead of spawning a thread, so it builds and
runs on wasm32 with or without default features.
- webclaw-core: bound the structured-data scrubber recursion (depth cap)
so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the
stack.
- webclaw-fetch: stream the response body with a running ceiling so a
small highly compressed payload cannot inflate to gigabytes in memory;
redact user:pass@ from proxy URLs before they reach error strings.
- webclaw-cli: contain output filenames inside the chosen directory
(reject .. / absolute, drop traversal path segments), run --webhook
URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s,
and make research slug truncation char-safe.
- webclaw-mcp: char-safe slug truncation (no multibyte slice panic).
- setup.sh / deploy/hetzner.sh: replace eval on read input with
printf -v, and mask auth key / API token in console output.
- CI: enforce the wasm32 build invariant for webclaw-core.
Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.
Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.
Improve LLM-format output for modern news and documentation pages.
- Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records
- Fix element/text spacing without detaching punctuation on docs, forums, and reference pages
- Remove common accessibility link chrome from LLM text and link labels
- Bump workspace version to 0.6.0 and update the changelog
Thanks to Nenad Oric (@devnen) for the original PR and contribution.
- New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26.
Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS
extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers,
gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3
8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on
immobiliare.it with country-it residential.
- BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped
MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256,
explicit extension_permutation, advertise h3 in ALPN and ALPS.
JA3 43067709b025da334de1279a120f8e14, akamai_fp
52d84b11737d980aef856699f885ca86. Fixes indeed.com and other
Cloudflare-fronted sites.
- New locale module: accept_language_for_url / accept_language_for_tld.
TLD to Accept-Language mapping, unknown TLDs default to en-US.
DataDome geo-vs-locale cross-checks are now trivially satisfiable.
- wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.
Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a
403 even from residential IPs. Their block list includes known
browser-emulation library fingerprints. wreq-Firefox passes. The
CLI `vertical` subcommand already forced Firefox; MCP
`vertical_scrape` was still falling back to the long-lived
`self.fetch_client` which defaults to Chrome, so reddit failed
on MCP and nobody noticed because the earlier test runs all had
an API key set that masked the issue.
Switched vertical_scrape to reuse `self.firefox_or_build()` which
gives us the cached Firefox client (same pattern the scrape tool
uses when the caller requests `browser: firefox`). Firefox is
strictly-safer-than-Chrome for every vertical in the catalog, so
making it the hard default for `vertical_scrape` is the right call.
Verified end-to-end from a clean shell with no WEBCLAW_API_KEY:
- MCP reddit: 679ms, post/author/6 comments correct
- MCP instagram_profile: 1157ms, 18471 followers
No change to the `scrape` tool -- it keeps the user-selectable
browser param.
Bumps version to 0.5.3.
Wires the vertical extractor catalog into both the CLI and the MCP
server so users don't have to hit the HTTP API to invoke them. Same
semantics as `/v1/scrape/{vertical}` + `/v1/extractors`.
CLI (webclaw-cli):
- New subcommand `webclaw extractors` lists all 28 extractors with
name, label, and sample URL. `--json` flag emits the full catalog
as machine-readable JSON.
- New subcommand `webclaw vertical <name> <url>` runs a specific
extractor and prints typed JSON. Pretty-printed by default; `--raw`
for single-line. Exits 1 with a clear "URL does not match" error
on mismatch.
- FetchClient built with Firefox profile + cloud fallback attached
when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate.
MCP (webclaw-mcp):
- New tool `list_extractors` (no args) returns the catalog as
pretty-printed JSON for in-session discovery.
- New tool `vertical_scrape` takes `{name, url}` and returns typed
JSON. Reuses the long-lived self.fetch_client.
- Tool count goes from 10 to 12. Server-info instruction string
updated accordingly.
Tests: 215 passing, clippy clean. Manual surface-tested end-to-end:
CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns
28-entry catalog + typed responses for pypi/requests + rust-lang/rust
in 200-400ms.
Version bumped to 0.5.2 (minor for API additions, backwards compatible).
Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now
take `client: &dyn Fetcher` instead of `client: &FetchClient` directly.
Backwards-compatible: FetchClient implements Fetcher, blanket impls
cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server
callers keep working unchanged.
Motivation: the production API server (api.webclaw.io) must not do
in-process TLS fingerprinting; it delegates all HTTP to the Go
tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on
production would have required importing wreq into the server's
dep graph, violating the CLAUDE.md rule. Now production can provide
its own TlsSidecarFetcher implementation and pass it to the same
dispatcher the OSS server uses.
Changes:
- New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus
blanket impls for `&T` and `Arc<T>`.
- `FetchClient` gains a tiny impl block in client.rs that forwards to
its existing public methods.
- All 28 extractor signatures migrated from `&FetchClient` to
`&dyn Fetcher` (sed-driven bulk rewrite, no semantic change).
- `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`.
- `extractors::dispatch_by_url` and `extractors::dispatch_by_name`
take `&dyn Fetcher`.
- `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has
native async-fn-in-trait but dyn dispatch still needs async_trait).
- Version bumped to 0.5.1, CHANGELOG updated.
Tests: 215 passing in webclaw-fetch (no new tests needed — the existing
extractor tests exercise the trait methods transparently).
Clippy: clean workspace-wide.
Addresses the four follow-ups surfaced by the cloud-key smoke test.
trustpilot_reviews — full rewrite for 2025 schema:
- Trustpilot moved from single-Organization+aggregateRating to three
separate JSON-LD blocks: a site-level Organization (Trustpilot
itself), a Dataset with a csvw:Table mainEntity carrying the
per-star distribution for the target business, and an aiSummary +
aiSummaryReviews block with the AI-generated summary and recent
review objects.
- Parser now: skips the site-level Org, walks @graph as either array
or single object, picks the Dataset whose about.@id references the
target domain, parses each csvw:column for rating buckets, computes
weighted-average rating + total from the distribution, extracts the
aiSummary text, and turns aiSummaryReviews into a clean reviews
array with author/country/date/rating/title/text/likes.
- OG-title regex fallbacks for business_name, rating_label, and
average_rating when the Dataset block is absent. OG-description
regex for review_count.
- Returned shape: url, domain, business_name, rating_label,
average_rating, review_count, rating_distribution (per-star count
and percent), ai_summary, recent_reviews, review_count_listed,
data_source.
- Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 /
226 reviews with full distribution + AI summary + 2 recent reviews.
amazon_product — force-cloud-escalation + OG fallback:
- Amazon serves Product JSON-LD intermittently even on non-CAPTCHA
pages. When local fetch returns HTML without Product JSON-LD and
a cloud client is configured, force-escalate to the cloud path
which reliably surfaces title + description via its render engine.
- New OG meta-tag fallback for title/image/description so the
cloud's synthesize_html output (OG tags only, no #productTitle DOM
ID) still yields useful data. Real Amazon pages still prefer the
DOM regex.
- Verified live: B0BSHF7WHW escalates to cloud, returns Apple
MacBook Pro title + description + asin.
etsy_listing — slug humanization + generic-page filtering + shop
from brand:
- Etsy serves various placeholder pages when a listing is delisted,
blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...",
"This item is unavailable - Etsy", plus the OG description
"Sorry, the page you were looking for was not found." is_generic_*
helpers catch all three shapes.
- When the OG title is generic, humanise the URL slug: the path
`/listing/123456789/personalized-stainless-steel-tumbler` becomes
`Personalized Stainless Steel Tumbler` so callers always get a
meaningful title even on dead listings.
- Etsy uses `brand` (top-level JSON-LD field) for the shop name on
listings that don't ship offers[].seller.name. Shop now falls
through offers -> brand so either schema resolves.
- Verified live: listing/1097462299 returns full rich data
(title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating /
225 reviews, InStock).
cloud.rs — module doc update:
- Added an architecture section documenting that api.webclaw.io does
not return raw HTML by design and that [`synthesize_html`]
reassembles the parsed response (metadata + structured_data +
markdown) back into minimal HTML so existing local parsers run
unchanged across both paths. Also notes the DOM-regex limitation
for extractors that need live-page-specific DOM IDs.
Tests: 215 passing in webclaw-fetch (18 new), clippy clean.
Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY:
28/28 clean, 0 partial, 0 failed.
api.webclaw.io/v1/scrape does not return a `html` field even when
`formats=["html"]` is requested, by design: the cloud API returns
pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags,
title, description, image, site_name), and `markdown`.
Our CloudClient::fetch_html helper was premised on the API returning
raw HTML. Without a key set, the error message was hidden behind
CloudError::NotConfigured so the bug never surfaced. With a key set,
every extractor that escalated to cloud (trustpilot_reviews,
etsy_listing, amazon_product, ebay_listing, substack_post HTML
fallback) got back "cloud /v1/scrape returned no html field".
Fix: reassemble a minimal synthetic HTML document from the cloud's
parsed output. Each JSON-LD block goes back into a
`<script type="application/ld+json">` tag, metadata fields become OG
`<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing
local extractor parsers (find_product_jsonld, find_business,
og() regex) see the same shapes they'd see from a real page, so no
per-extractor changes needed.
Verified end-to-end with WEBCLAW_CLOUD_API_KEY set:
- trustpilot_reviews: escalates, returns Organization JSON-LD data
(parser picks Trustpilot site-level Org not the reviewed business;
tracked as a follow-up to update Trustpilot schema handling)
- etsy_listing: escalates via antibot render path; listing-specific
data depends on target listing having JSON-LD (many Etsy listings
don't)
- amazon_product, ebay_listing: stay local because their pages ship
enough content not to trigger bot-detection escalation
- The other 24 extractors unchanged (local path, zero cloud credits)
Tests: 200 passing in webclaw-fetch (3 new), clippy clean.
Two targeted fixes surfaced by the manual extractor smoke test.
cloud::is_bot_protected:
- Trustpilot serves a ~565-byte AWS WAF interstitial with the string
"Verifying your connection..." and an `interstitial-spinner` div.
That pattern was not in our detector, so local fetch returned the
challenge page, JSON-LD parsing found nothing, and the extractor
emitted a confusing "no Organization/LocalBusiness JSON-LD" error.
- Added the pattern plus a <10KB size gate so real articles that
happen to mention the phrase aren't misclassified. Two new tests
cover positive + negative cases.
- With the fix, trustpilot_reviews now correctly escalates via
smart_fetch_html and returns the clean "Set WEBCLAW_API_KEY"
actionable error without a key, or cloud-bypassed HTML with one.
ecommerce_product:
- Previously hard-failed when a page had no Product JSON-LD, and
produced an empty `offers` list when JSON-LD was present but its
`offers` node was. Many sites (Patagonia-style catalog pages,
smaller Squarespace stores) ship one or the other of OG / JSON-LD
but not both with price data.
- Added OG meta-tag fallback that handles:
* no JSON-LD at all -> build minimal payload from og:title,
og:image, og:description, product:price:amount,
product:price:currency, product:availability, product:brand
* JSON-LD present but offers empty -> augment with an OG-derived
offer so price comes through
- New `data_source` field: "jsonld", "jsonld+og", or "og_fallback"
so callers can tell which branch populated the data.
- `has_og_product_signal()` requires og:type=product or a price tag
so blog posts don't get mis-classified as products.
Tests: 197 passing in webclaw-fetch (6 new), clippy clean.
Adds etsy_listing and hardens two existing extractors with HTML fallbacks
so transient API failures still return useful data.
New:
- etsy_listing: /listing/{id}(/slug) with Schema.org Product JSON-LD +
OG fallback. Antibot-gated, routes through cloud::smart_fetch_html
like amazon_product and ebay_listing. Auto-dispatched (etsy host is
unique).
Hardened:
- substack_post: when /api/v1/posts/{slug} returns non-200 (rate limit,
403 on hardened custom domains, 5xx), fall back to HTML fetch and
parse OG tags + Article JSON-LD. Response shape is stable across
both paths, with a `data_source` field of "api" or "html_fallback".
- youtube_video: when ytInitialPlayerResponse is missing (EU-consent
interstitial, age-gated, some live pre-shows), fall back to OG tags
for title/description/thumbnail. `data_source` now "player_response"
or "og_fallback".
Tests: 91 passing in webclaw-fetch (9 new), clippy clean.
Three hard-site extractors that all require antibot bypass to ever
return usable data. They ship in OSS so the parsers + schema live
with the rest of the vertical extractors, but the fetch path routes
through cloud::smart_fetch_html \u2014 meaning:
- With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or
WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on
challenge-page detection we escalate to api.webclaw.io/v1/scrape
with formats=['html'] and parse the antibot-bypassed HTML locally.
- Without a cloud key, callers get a typed CloudError::NotConfigured
whose Display message points at https://webclaw.io/signup.
Self-hosters without a webclaw.io account know exactly what to do.
## New extractors (all auto-dispatched \u2014 unique hosts)
- amazon_product: ASIN extraction from /dp/, /gp/product/,
/product/, /exec/obidos/ASIN/ URL shapes across every amazon.*
locale. Parses the Product JSON-LD Amazon ships for SEO; falls
back to #productTitle and #landingImage DOM selectors when
JSON-LD is absent. Returns price, currency, availability,
condition, brand, image, aggregate rating, SKU / MPN.
- ebay_listing: item-id extraction from /itm/{id} and
/itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr /
.it. Parses both bare Offer (Buy It Now) and AggregateOffer
(used-copies / auctions) from the Product JSON-LD. Returns
price or low/high-price range, currency, condition, seller,
offer_count, aggregate rating.
- trustpilot_reviews: reactivated from the `trustpilot_reviews`
file that was previously dead-code'd. Parser already worked; it
just needed the smart_fetch_html path to get past AWS WAF's
'Verifying Connection' interstitial. Organisation / LocalBusiness
JSON-LD block gives aggregate rating + up to 20 recent reviews.
## FetchClient change
- Added optional `cloud: Option<Arc<CloudClient>>` field with
`FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)`
accessor. Extractors call client.cloud() to decide whether they
can escalate. Cheap clones (Arc-wrapped).
## webclaw-server wiring
AppState::new() now reads the cloud credential from env:
1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the
server's own inbound bearer token.
2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open
mode (no inbound-auth key set), matching the MCP / CLI
convention of that env var.
When present, state.rs builds a CloudClient and attaches it to the
FetchClient via with_cloud(). Log line at startup so operators see
when cloud fallback is active.
## Catalog + dispatch
All three extractors registered in list() and in dispatch_by_url.
/v1/extractors catalog now exposes 22 verticals. Explicit
/v1/scrape/{vertical} routes work per the existing pattern.
## Tests
- 7 new unit tests (parse_asin multi-shape + parse from JSON-LD
fixture + DOM-fallback on missing JSON-LD for Amazon; ebay
URL-matching + slugged-URL parsing + both Offer and AggregateOffer
fixtures).
- Full extractors suite: 68 passing (was 59, +9 from the new files).
- fmt + clippy clean.
- No live-test story for these three inside CI \u2014 verifying them
means having WEBCLAW_CLOUD_API_KEY set against a real cloud
backend. Integration-test harness is a separate follow-up.
Catalog summary: 22 verticals total across wave 1-5. Hard-site
three are gated behind an actionable cloud-fallback upgrade path
rather than silently returning nothing or 403-ing the caller.
The local-first / cloud-fallback flow was duplicated in two places:
- webclaw-mcp/src/cloud.rs (302 lines, canonical)
- webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid
pulling rmcp as a dep)
Move to the shared crate where all vertical extractors and the new
webclaw-server can also reach it.
## New module: webclaw-fetch/src/cloud.rs
Single canonical home. Consolidates both previous versions and
promotes the error type from stringy to typed:
- `CloudError` enum with dedicated variants for the four HTTP
outcomes callers act on differently — 401 (key rejected),
402 (insufficient plan), 429 (rate limited), plus ServerError /
Network / ParseFailed. Each variant's Display message ends with
an actionable URL (signup / pricing / dashboard) so API consumers
can surface it verbatim.
- `From<CloudError> for String` bridge so the dozen existing
`.await?` call sites in MCP / CLI that expected `Result<_, String>`
keep compiling. We can migrate them to the typed error per-site
later without a churn commit.
- `CloudClient::new(Option<&str>)` matches the CLI's `--api-key`
flag pattern (explicit key wins, env fallback, None when empty).
`::from_env()` kept for MCP-style call sites.
- `with_key_and_base` for staging / integration tests.
- `scrape / post / get / fetch_html` — `fetch_html` is new, a
convenience that calls /v1/scrape with formats=["html"] and
returns the raw HTML string so vertical extractors can plug
antibot-bypassed HTML straight into their parsers.
- `is_bot_protected` + `needs_js_rendering` detectors moved
over verbatim. Detection patterns are public (CF / DataDome /
AWS WAF challenge-page signatures) — no moat leak.
- `smart_fetch` kept on the original `Result<_, String>`
signature so MCP's six call sites compile unchanged.
- `smart_fetch_html` is new: the local-first-then-cloud flow
for the vertical-extractor pattern, returning the typed
`CloudError` so extractors can emit precise upgrade-path
messages.
## Cleanup
- Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to
`webclaw_fetch:☁️:*`. Dropped reqwest as a direct dep of
webclaw-mcp (it only used it for the old cloud client).
- Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its
webhook / on-change / research HTTP calls.
- webclaw-fetch now has reqwest as a direct dep. It was already
transitively pulled in by webclaw-llm; this just makes the
dependency relationship explicit at the call site.
## Tests
16 new unit tests cover:
- CloudError status mapping (401/402/429/5xx)
- NotConfigured error includes signup URL
- CloudClient::new explicit-key-wins-over-env + empty-string = None
- base_url strips trailing slash
- Detector matrix (CF challenge / Turnstile / real content with
embedded Turnstile / SPA skeleton / real article with script tags)
- truncate respects char boundaries (don't slice inside UTF-8)
Full workspace test suite still passes (~500 tests). fmt + clippy
clean. No behavior change for existing MCP / CLI call sites.
Two ecommerce extractors covering the long tail of online stores:
- shopify_product: hits the public /products/{handle}.json endpoint
that every Shopify store exposes. Undocumented but stable for 10+
years. Returns title, vendor, product_type, tags, full variants
array (price, SKU, stock, options), images, options matrix, and
the price_min/price_max/any_available summary fields. Covers the
~4M Shopify stores out there, modulo stores that put Cloudflare
in front of the shop. Rejects known non-Shopify hosts (amazon,
etsy, walmart, etc.) to save a failed request.
- ecommerce_product: generic Schema.org Product JSON-LD extractor.
Works on any modern store that ships the Google-required Product
rich-result markup: Shopify, WooCommerce, BigCommerce, Squarespace,
Magento, custom storefronts. Returns name, brand, SKU, GTIN, MPN,
images, normalized offers (Offer and AggregateOffer flattened into
one shape with price, currency, availability, condition),
aggregateRating, and the raw JSON-LD block for anyone who wants it.
Reuses webclaw_core::structured_data::extract_json_ld so the
JSON-LD parser stays shared across the extraction pipeline.
Both are explicit-call only — /v1/scrape/shopify_product and
/v1/scrape/ecommerce_product. Not in auto-dispatch because any
arbitrary /products/{slug} URL could belong to either platform
(or to a custom site that uses the same path shape), and claiming
such URLs blindly would steal from the default markdown /v1/scrape
flow.
Live test results against real stores:
- Shopify / Allbirds Tree Runners: $100, 7 size variants, 4 images,
Size option, all SKUs. 250ms.
- ecommerce_product / same Allbirds URL: ProductGroup schema, name
'Men's Tree Runner', brand 'Allbirds', $100 USD InStock offer.
300ms. Different extraction path, same product.
- ecommerce_product / huel.com: 'Huel Black Edition' / 'Huel' brand,
200ms.
- Shopify stores behind Cloudflare (Gymshark, Tesla Shop) 403 as
expected \u2014 the error message points callers at the ecommerce_product
fallback, but Cloudflare also blocks the HTML path so those stores
are cloud-tier territory.
Catalog now exposes 19 extractors via GET /v1/extractors. Unit
tests: 59 passing across the module.
Scope not in v1:
- trustpilot_reviews: file written and tested (JSON-LD walker), but
NOT registered in the catalog or dispatch. Trustpilot's Cloudflare
turnstile blocks our Firefox + Chrome + Safari + mobile profiles
at the TLS layer. Shipping it would return 403 more often than 200.
Code kept in-tree under #[allow(dead_code)] for when the cloud
tier has residential-proxy support.
- Amazon / Walmart / Target / AliExpress: same Cloudflare / WAF
story. Not fixable without real browser + proxy pool.
- WooCommerce explicit: most WooCommerce stores ship Product JSON-LD,
so ecommerce_product covers them. A dedicated WooCommerce REST
extractor (/wp-json/wc/store/products) would be marginal on top of
that and only works on ~30% of stores that expose the REST API.
Wave 4 positioning: we now own the OSS structured-scrape space for
any site that respects Schema.org. That's Google's entire rich-result
index \u2014 meaningful territory competitors won't try to replicate as
named endpoints.
3 social-network extractors that work entirely without auth, using
public embed/preview endpoints + Instagram's own SEO-facing API:
- linkedin_post: /embed/feed/update/{urn} returns full body,
author, image, OG tags. Accepts both the urn:li:share
and urn:li:activity URN forms plus the pretty
/posts/{slug}-{id}-{suffix} URLs.
- instagram_post: /p/{shortcode}/embed/captioned/ returns the full
caption, username, thumbnail. Same endpoint serves
reels and IGTV, kind correctly classified.
- instagram_profile: /api/v1/users/web_profile_info/?username=X with the
x-ig-app-id header (Instagram's public web-app id,
sent by their own JS bundle). Returns the full
profile + the 12 most recent posts with shortcodes,
kinds, like/comment counts, thumbnails, and caption
previews. Falls back to OG-tag scraping of the
public HTML if the API ever 401/403s.
The IG profile output is shaped so callers can fan out cleanly:
for p in profile.recent_posts:
scrape('instagram_post', p.url)
giving you 'whole profile + every recent post' in one loop. End-to-end
tested against ticketswave: 1 profile call + 12 post calls in ~3.5s.
Pagination beyond 12 posts requires authenticated cookies and is left
for the cloud where we can stash a session.
Infrastructure change: added FetchClient::fetch_with_headers so
extractors can satisfy site-specific request headers (here x-ig-app-id;
later github_pr will use this for Authorization, etc.) without polluting
the global FetchConfig.headers map. Same retry semantics as fetch().
Catalog now exposes 17 extractors via /v1/extractors. Total unit tests
across the module: 47 passing. Clippy clean. Fmt clean.
Live test on the maintainer's example URLs:
- LinkedIn post (urn:li:share:7452618582213144577): 'Orc Dev' / full body
/ shipper.club link / CDN image extracted in 250ms.
- Instagram post (DT-RICMjeK5): 835-char Slovak caption, ticketswave
username, thumbnail. 200ms.
- Instagram profile (ticketswave): 18,473 followers (exact, not
rounded), is_verified=True, is_business=True, biography with emojis,
12 recent posts with shortcodes + kinds + likes. 400ms.
Out of scope for this wave (require infra we don't have):
- linkedin_profile: returns 999 to all bot UAs, needs OAuth
- facebook_post / facebook_page: content is JS-loaded, needs cloud Chrome
- facebook_profile (personal): not publicly accessible by design
Adds 8 more vertical extractors using public JSON APIs. All hit
deterministic endpoints with no antibot risk. Live tests pass
against canonical URLs for each.
AI / ML ecosystem (3):
- crates_io \u2192 crates.io/api/v1/crates/{name}
- huggingface_dataset \u2192 huggingface.co/api/datasets/{path} (handles both
legacy /datasets/{name} and canonical {owner}/{name})
- arxiv \u2192 export.arxiv.org/api/query (Atom XML parsed by quick-xml)
Code / version control (2):
- github_pr \u2192 api.github.com/repos/{owner}/{repo}/pulls/{number}
- github_release \u2192 api.github.com/repos/{owner}/{repo}/releases/tags/{tag}
Infrastructure (1):
- docker_hub \u2192 hub.docker.com/v2/repositories/{namespace}/{name}
(official-image shorthand /_/nginx normalized to library/nginx)
Community / publishing (2):
- dev_to \u2192 dev.to/api/articles/{username}/{slug}
- stackoverflow \u2192 api.stackexchange.com/2.3/questions/{id} + answers,
filter=withbody for rendered HTML, sort=votes for
consistent top-answers ordering
Live test results (real URLs):
- serde: 942M downloads, 838B response
- 'Attention Is All You Need': abstract + authors, 1.8KB
- nginx official: 12.9B pulls, 21k stars, 17KB
- openai/gsm8k: 822k downloads, 1.7KB
- rust-lang/rust#138000: merged by RalfJung, +3/-2, 1KB
- webclaw v0.4.0: 2.4KB
- a real dev.to article: 2.2KB body, 3.1KB total
- python yield Q&A: score 13133, 51 answers, 104KB
Catalog now exposes 14 extractors via GET /v1/extractors. Total
unit tests across the module: 34 passing. Clippy clean. Fmt clean.
Marketing positioning sharpens: 14 dedicated extractors, all
deterministic, all 1-credit-per-call. Firecrawl's /extract is
5 credits per call and you write the schema yourself.
Reddit blocks wreq's Chrome 145 BoringSSL fingerprint at the JA3/JA4
TLS layer even though our HTTP headers correctly impersonate Chrome.
Curl from the same machine with the same Chrome User-Agent string
returns 200 from Reddit's .json endpoint; webclaw with the Chrome
profile returns 403. The detector clearly fingerprints below the
header layer.
Tested all six vertical extractors with the Firefox profile:
reddit, hackernews, github_repo, pypi, npm, huggingface_model all
return correct typed JSON. Firefox is a strict improvement on the
Chrome default for sites with active TLS-level bot detection, with
no regressions on the API-flavored sites that were already working.
Real fix is per-extractor preferred profile, but the structural
change to allow per-call profile selection in FetchClient is a
larger refactor. Flipping the global default is a one-line change
that ships the unblock now and lets users hit the new
/v1/scrape/{vertical} routes against Reddit immediately.
New extractors module returns site-specific typed JSON instead of
generic markdown. Each extractor:
- declares a URL pattern via matches()
- fetches from the site's official JSON API where one exists
- returns a typed serde_json::Value with documented field names
- exposes an INFO struct that powers the /v1/extractors catalog
First 6 verticals shipped, all hitting public JSON APIs (no HTML
scraping, zero antibot risk):
- reddit → www.reddit.com/*/.json
- hackernews → hn.algolia.com/api/v1/items/{id} (full thread in one call)
- github_repo → api.github.com/repos/{owner}/{repo}
- pypi → pypi.org/pypi/{name}/json
- npm → registry.npmjs.org/{name} + downloads/point/last-week
- huggingface_model → huggingface.co/api/models/{owner}/{name}
Server-side routes added:
- POST /v1/scrape/{vertical} explicit per-vertical extraction
- GET /v1/extractors catalog (name, label, description, url_patterns)
The dispatcher validates that URL matches the requested vertical
before running, so users get "URL doesn't match the X extractor"
instead of opaque parse failures inside the extractor.
17 unit tests cover URL matching + path parsing for each vertical.
Live tests against canonical URLs (rust-lang/rust, requests pypi,
react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post)
all return correct typed JSON in 100-300ms. Sample sizes: github
863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree).
Marketing positioning: Firecrawl charges 5 credits per /extract call
and you write the schema. Webclaw returns the same JSON in 1 credit
per /scrape/{vertical} call with hand-written deterministic
extractors per site.
cargo install webclaw-mcp on a fresh machine prints
warning: field `tool_router` is never read
--> crates/webclaw-mcp/src/server.rs:22:5
The field is essential — dropping it unregisters every MCP tool. The
warning shows up because rmcp 1.3.x changed how the #[tool_handler]
macro reads the field: instead of referencing it by name in the
generated impl, it goes through a derived trait method. rustc's
dead-code lint sees only the named usage and fires.
The field stays. Annotated with #[allow(dead_code)] and a comment
explaining the situation so the next person looking at this doesn't
remove the field thinking it's actually unused.
No behaviour change. Verified clean compile under rmcp 1.3.0 in our
lock; the warning will disappear for anyone running cargo install
against this commit.