Updated the README to reflect changes in the project description, banner image size, and various content sections. Enhanced clarity on features and usage.
Improve LLM-format output for modern news and documentation pages.
- Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records
- Fix element/text spacing without detaching punctuation on docs, forums, and reference pages
- Remove common accessibility link chrome from LLM text and link labels
- Bump workspace version to 0.6.0 and update the changelog
Thanks to Nenad Oric (@devnen) for the original PR and contribution.
The webclaw-core youtube module produces structured markdown but no
transcript; document that and point at the production server's
youtube_transcript.rs short-circuit for the full YoutubeData + caption
text shape.
The repo had no heading-level brand anchor, only a banner image and
an h3 slogan. Search engines indexing the README were missing the
canonical brand signal. The new h1 is what GitHub renders as the
title of the page and what Google co-ranks with webclaw.io.
Bumps workspace version to 0.5.7.
Surface webclaw.io as a clear alternative path for visitors who want
the antibot, JS rendering, async jobs, search, and watches the OSS
server doesn't ship. Sits between the value-prop and the install
instructions so self-host stays the primary on-ramp.
- New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26.
Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS
extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers,
gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3
8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on
immobiliare.it with country-it residential.
- BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped
MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256,
explicit extension_permutation, advertise h3 in ALPN and ALPS.
JA3 43067709b025da334de1279a120f8e14, akamai_fp
52d84b11737d980aef856699f885ca86. Fixes indeed.com and other
Cloudflare-fronted sites.
- New locale module: accept_language_for_url / accept_language_for_tld.
TLD to Accept-Language mapping, unknown TLDs default to en-US.
DataDome geo-vs-locale cross-checks are now trivially satisfiable.
- wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.
Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a
403 even from residential IPs. Their block list includes known
browser-emulation library fingerprints. wreq-Firefox passes. The
CLI `vertical` subcommand already forced Firefox; MCP
`vertical_scrape` was still falling back to the long-lived
`self.fetch_client` which defaults to Chrome, so reddit failed
on MCP and nobody noticed because the earlier test runs all had
an API key set that masked the issue.
Switched vertical_scrape to reuse `self.firefox_or_build()` which
gives us the cached Firefox client (same pattern the scrape tool
uses when the caller requests `browser: firefox`). Firefox is
strictly-safer-than-Chrome for every vertical in the catalog, so
making it the hard default for `vertical_scrape` is the right call.
Verified end-to-end from a clean shell with no WEBCLAW_API_KEY:
- MCP reddit: 679ms, post/author/6 comments correct
- MCP instagram_profile: 1157ms, 18471 followers
No change to the `scrape` tool -- it keeps the user-selectable
browser param.
Bumps version to 0.5.3.
Wires the vertical extractor catalog into both the CLI and the MCP
server so users don't have to hit the HTTP API to invoke them. Same
semantics as `/v1/scrape/{vertical}` + `/v1/extractors`.
CLI (webclaw-cli):
- New subcommand `webclaw extractors` lists all 28 extractors with
name, label, and sample URL. `--json` flag emits the full catalog
as machine-readable JSON.
- New subcommand `webclaw vertical <name> <url>` runs a specific
extractor and prints typed JSON. Pretty-printed by default; `--raw`
for single-line. Exits 1 with a clear "URL does not match" error
on mismatch.
- FetchClient built with Firefox profile + cloud fallback attached
when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate.
MCP (webclaw-mcp):
- New tool `list_extractors` (no args) returns the catalog as
pretty-printed JSON for in-session discovery.
- New tool `vertical_scrape` takes `{name, url}` and returns typed
JSON. Reuses the long-lived self.fetch_client.
- Tool count goes from 10 to 12. Server-info instruction string
updated accordingly.
Tests: 215 passing, clippy clean. Manual surface-tested end-to-end:
CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns
28-entry catalog + typed responses for pypi/requests + rust-lang/rust
in 200-400ms.
Version bumped to 0.5.2 (minor for API additions, backwards compatible).
Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now
take `client: &dyn Fetcher` instead of `client: &FetchClient` directly.
Backwards-compatible: FetchClient implements Fetcher, blanket impls
cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server
callers keep working unchanged.
Motivation: the production API server (api.webclaw.io) must not do
in-process TLS fingerprinting; it delegates all HTTP to the Go
tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on
production would have required importing wreq into the server's
dep graph, violating the CLAUDE.md rule. Now production can provide
its own TlsSidecarFetcher implementation and pass it to the same
dispatcher the OSS server uses.
Changes:
- New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus
blanket impls for `&T` and `Arc<T>`.
- `FetchClient` gains a tiny impl block in client.rs that forwards to
its existing public methods.
- All 28 extractor signatures migrated from `&FetchClient` to
`&dyn Fetcher` (sed-driven bulk rewrite, no semantic change).
- `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`.
- `extractors::dispatch_by_url` and `extractors::dispatch_by_name`
take `&dyn Fetcher`.
- `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has
native async-fn-in-trait but dyn dispatch still needs async_trait).
- Version bumped to 0.5.1, CHANGELOG updated.
Tests: 215 passing in webclaw-fetch (no new tests needed — the existing
extractor tests exercise the trait methods transparently).
Clippy: clean workspace-wide.
webclaw-fetch switched from primp to wreq 6.x (BoringSSL) a while ago
but CLAUDE.md still documented primp, the `[patch.crates-io]`
requirement, and RUSTFLAGS that no longer apply. Refreshed four
sections:
- Crate listing: webclaw-fetch uses wreq, not primp
- client.rs description: wreq BoringSSL, plus a note that FetchClient
will implement the new Fetcher trait so production can swap in a
tls-sidecar-backed fetcher without importing wreq
- Hard Rules: dropped obsolete `[patch.crates-io]` and RUSTFLAGS lines,
added the "Vertical extractors take `&dyn Fetcher`" rule that makes
the architectural separation explicit for the upcoming production
integration
- Removed language about primp being "patched"; reqwest in webclaw-llm
is now just "plain reqwest" with no relationship to wreq
See CHANGELOG.md for the full entry. Headline: 28 site-specific
extractors returning typed JSON, five with automatic antibot
cloud-escalation via api.webclaw.io, `POST /v1/scrape/{vertical}` +
`GET /v1/extractors` on webclaw-server.
Addresses the four follow-ups surfaced by the cloud-key smoke test.
trustpilot_reviews — full rewrite for 2025 schema:
- Trustpilot moved from single-Organization+aggregateRating to three
separate JSON-LD blocks: a site-level Organization (Trustpilot
itself), a Dataset with a csvw:Table mainEntity carrying the
per-star distribution for the target business, and an aiSummary +
aiSummaryReviews block with the AI-generated summary and recent
review objects.
- Parser now: skips the site-level Org, walks @graph as either array
or single object, picks the Dataset whose about.@id references the
target domain, parses each csvw:column for rating buckets, computes
weighted-average rating + total from the distribution, extracts the
aiSummary text, and turns aiSummaryReviews into a clean reviews
array with author/country/date/rating/title/text/likes.
- OG-title regex fallbacks for business_name, rating_label, and
average_rating when the Dataset block is absent. OG-description
regex for review_count.
- Returned shape: url, domain, business_name, rating_label,
average_rating, review_count, rating_distribution (per-star count
and percent), ai_summary, recent_reviews, review_count_listed,
data_source.
- Verified live: anthropic.com returns "Anthropic" / "Bad" / 1.4 /
226 reviews with full distribution + AI summary + 2 recent reviews.
amazon_product — force-cloud-escalation + OG fallback:
- Amazon serves Product JSON-LD intermittently even on non-CAPTCHA
pages. When local fetch returns HTML without Product JSON-LD and
a cloud client is configured, force-escalate to the cloud path
which reliably surfaces title + description via its render engine.
- New OG meta-tag fallback for title/image/description so the
cloud's synthesize_html output (OG tags only, no #productTitle DOM
ID) still yields useful data. Real Amazon pages still prefer the
DOM regex.
- Verified live: B0BSHF7WHW escalates to cloud, returns Apple
MacBook Pro title + description + asin.
etsy_listing — slug humanization + generic-page filtering + shop
from brand:
- Etsy serves various placeholder pages when a listing is delisted,
blocked, or unavailable: "etsy.com", "Etsy - Your place to buy...",
"This item is unavailable - Etsy", plus the OG description
"Sorry, the page you were looking for was not found." is_generic_*
helpers catch all three shapes.
- When the OG title is generic, humanise the URL slug: the path
`/listing/123456789/personalized-stainless-steel-tumbler` becomes
`Personalized Stainless Steel Tumbler` so callers always get a
meaningful title even on dead listings.
- Etsy uses `brand` (top-level JSON-LD field) for the shop name on
listings that don't ship offers[].seller.name. Shop now falls
through offers -> brand so either schema resolves.
- Verified live: listing/1097462299 returns full rich data
(title, price 51.43 EUR, shop BlankEarthCeramics, 4.9 rating /
225 reviews, InStock).
cloud.rs — module doc update:
- Added an architecture section documenting that api.webclaw.io does
not return raw HTML by design and that [`synthesize_html`]
reassembles the parsed response (metadata + structured_data +
markdown) back into minimal HTML so existing local parsers run
unchanged across both paths. Also notes the DOM-regex limitation
for extractors that need live-page-specific DOM IDs.
Tests: 215 passing in webclaw-fetch (18 new), clippy clean.
Smoke test against all 28 extractors with WEBCLAW_CLOUD_API_KEY:
28/28 clean, 0 partial, 0 failed.
api.webclaw.io/v1/scrape does not return a `html` field even when
`formats=["html"]` is requested, by design: the cloud API returns
pre-parsed `structured_data` (JSON-LD blocks), `metadata` (OG tags,
title, description, image, site_name), and `markdown`.
Our CloudClient::fetch_html helper was premised on the API returning
raw HTML. Without a key set, the error message was hidden behind
CloudError::NotConfigured so the bug never surfaced. With a key set,
every extractor that escalated to cloud (trustpilot_reviews,
etsy_listing, amazon_product, ebay_listing, substack_post HTML
fallback) got back "cloud /v1/scrape returned no html field".
Fix: reassemble a minimal synthetic HTML document from the cloud's
parsed output. Each JSON-LD block goes back into a
`<script type="application/ld+json">` tag, metadata fields become OG
`<meta>` tags, and the markdown body lands in a `<pre>` tag. Existing
local extractor parsers (find_product_jsonld, find_business,
og() regex) see the same shapes they'd see from a real page, so no
per-extractor changes needed.
Verified end-to-end with WEBCLAW_CLOUD_API_KEY set:
- trustpilot_reviews: escalates, returns Organization JSON-LD data
(parser picks Trustpilot site-level Org not the reviewed business;
tracked as a follow-up to update Trustpilot schema handling)
- etsy_listing: escalates via antibot render path; listing-specific
data depends on target listing having JSON-LD (many Etsy listings
don't)
- amazon_product, ebay_listing: stay local because their pages ship
enough content not to trigger bot-detection escalation
- The other 24 extractors unchanged (local path, zero cloud credits)
Tests: 200 passing in webclaw-fetch (3 new), clippy clean.