Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.
Updated the README to reflect changes in the project description, banner image size, and various content sections. Enhanced clarity on features and usage.
Improve LLM-format output for modern news and documentation pages.
- Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records
- Fix element/text spacing without detaching punctuation on docs, forums, and reference pages
- Remove common accessibility link chrome from LLM text and link labels
- Bump workspace version to 0.6.0 and update the changelog
Thanks to Nenad Oric (@devnen) for the original PR and contribution.
The webclaw-core youtube module produces structured markdown but no
transcript; document that and point at the production server's
youtube_transcript.rs short-circuit for the full YoutubeData + caption
text shape.
The repo had no heading-level brand anchor, only a banner image and
an h3 slogan. Search engines indexing the README were missing the
canonical brand signal. The new h1 is what GitHub renders as the
title of the page and what Google co-ranks with webclaw.io.
Bumps workspace version to 0.5.7.
Surface webclaw.io as a clear alternative path for visitors who want
the antibot, JS rendering, async jobs, search, and watches the OSS
server doesn't ship. Sits between the value-prop and the install
instructions so self-host stays the primary on-ramp.
- New BrowserProfile::SafariIos mapped to BrowserVariant::SafariIos26.
Built on wreq_util::Emulation::SafariIos26 with 4 overrides (TLS
extension order, HTTP/2 HEADERS priority, real Safari iOS 26 headers,
gzip/deflate/br). Matches bogdanfinn safari_ios_26_0 JA3
8d909525bd5bbb79f133d11cc05159fe exactly. Empirically 9/10 on
immobiliare.it with country-it residential.
- BrowserProfile::Chrome aligned to bogdanfinn chrome_133: dropped
MAX_CONCURRENT_STREAMS from H2 SETTINGS, priority weight 256,
explicit extension_permutation, advertise h3 in ALPN and ALPS.
JA3 43067709b025da334de1279a120f8e14, akamai_fp
52d84b11737d980aef856699f885ca86. Fixes indeed.com and other
Cloudflare-fronted sites.
- New locale module: accept_language_for_url / accept_language_for_tld.
TLD to Accept-Language mapping, unknown TLDs default to en-US.
DataDome geo-vs-locale cross-checks are now trivially satisfiable.
- wreq-util bumped 2.2.6 to 3.0.0-rc.10 for Emulation::SafariIos26.
Reddit's .json API rejects the wreq-Chrome TLS fingerprint with a
403 even from residential IPs. Their block list includes known
browser-emulation library fingerprints. wreq-Firefox passes. The
CLI `vertical` subcommand already forced Firefox; MCP
`vertical_scrape` was still falling back to the long-lived
`self.fetch_client` which defaults to Chrome, so reddit failed
on MCP and nobody noticed because the earlier test runs all had
an API key set that masked the issue.
Switched vertical_scrape to reuse `self.firefox_or_build()` which
gives us the cached Firefox client (same pattern the scrape tool
uses when the caller requests `browser: firefox`). Firefox is
strictly-safer-than-Chrome for every vertical in the catalog, so
making it the hard default for `vertical_scrape` is the right call.
Verified end-to-end from a clean shell with no WEBCLAW_API_KEY:
- MCP reddit: 679ms, post/author/6 comments correct
- MCP instagram_profile: 1157ms, 18471 followers
No change to the `scrape` tool -- it keeps the user-selectable
browser param.
Bumps version to 0.5.3.
Wires the vertical extractor catalog into both the CLI and the MCP
server so users don't have to hit the HTTP API to invoke them. Same
semantics as `/v1/scrape/{vertical}` + `/v1/extractors`.
CLI (webclaw-cli):
- New subcommand `webclaw extractors` lists all 28 extractors with
name, label, and sample URL. `--json` flag emits the full catalog
as machine-readable JSON.
- New subcommand `webclaw vertical <name> <url>` runs a specific
extractor and prints typed JSON. Pretty-printed by default; `--raw`
for single-line. Exits 1 with a clear "URL does not match" error
on mismatch.
- FetchClient built with Firefox profile + cloud fallback attached
when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate.
MCP (webclaw-mcp):
- New tool `list_extractors` (no args) returns the catalog as
pretty-printed JSON for in-session discovery.
- New tool `vertical_scrape` takes `{name, url}` and returns typed
JSON. Reuses the long-lived self.fetch_client.
- Tool count goes from 10 to 12. Server-info instruction string
updated accordingly.
Tests: 215 passing, clippy clean. Manual surface-tested end-to-end:
CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns
28-entry catalog + typed responses for pypi/requests + rust-lang/rust
in 200-400ms.
Version bumped to 0.5.2 (minor for API additions, backwards compatible).
Adds `webclaw_fetch::Fetcher` trait. All 28 vertical extractors now
take `client: &dyn Fetcher` instead of `client: &FetchClient` directly.
Backwards-compatible: FetchClient implements Fetcher, blanket impls
cover `&T` and `Arc<T>`, so existing CLI / MCP / self-hosted-server
callers keep working unchanged.
Motivation: the production API server (api.webclaw.io) must not do
in-process TLS fingerprinting; it delegates all HTTP to the Go
tls-sidecar. Before this trait, exposing /v1/scrape/{vertical} on
production would have required importing wreq into the server's
dep graph, violating the CLAUDE.md rule. Now production can provide
its own TlsSidecarFetcher implementation and pass it to the same
dispatcher the OSS server uses.
Changes:
- New `crates/webclaw-fetch/src/fetcher.rs` defining the trait plus
blanket impls for `&T` and `Arc<T>`.
- `FetchClient` gains a tiny impl block in client.rs that forwards to
its existing public methods.
- All 28 extractor signatures migrated from `&FetchClient` to
`&dyn Fetcher` (sed-driven bulk rewrite, no semantic change).
- `cloud::smart_fetch` and `cloud::smart_fetch_html` take `&dyn Fetcher`.
- `extractors::dispatch_by_url` and `extractors::dispatch_by_name`
take `&dyn Fetcher`.
- `async-trait 0.1` added to webclaw-fetch deps (Rust 1.75+ has
native async-fn-in-trait but dyn dispatch still needs async_trait).
- Version bumped to 0.5.1, CHANGELOG updated.
Tests: 215 passing in webclaw-fetch (no new tests needed — the existing
extractor tests exercise the trait methods transparently).
Clippy: clean workspace-wide.
webclaw-fetch switched from primp to wreq 6.x (BoringSSL) a while ago
but CLAUDE.md still documented primp, the `[patch.crates-io]`
requirement, and RUSTFLAGS that no longer apply. Refreshed four
sections:
- Crate listing: webclaw-fetch uses wreq, not primp
- client.rs description: wreq BoringSSL, plus a note that FetchClient
will implement the new Fetcher trait so production can swap in a
tls-sidecar-backed fetcher without importing wreq
- Hard Rules: dropped obsolete `[patch.crates-io]` and RUSTFLAGS lines,
added the "Vertical extractors take `&dyn Fetcher`" rule that makes
the architectural separation explicit for the upcoming production
integration
- Removed language about primp being "patched"; reqwest in webclaw-llm
is now just "plain reqwest" with no relationship to wreq