Three hard-site extractors that all require antibot bypass to ever
return usable data. They ship in OSS so the parsers + schema live
with the rest of the vertical extractors, but the fetch path routes
through cloud::smart_fetch_html \u2014 meaning:
- With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or
WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on
challenge-page detection we escalate to api.webclaw.io/v1/scrape
with formats=['html'] and parse the antibot-bypassed HTML locally.
- Without a cloud key, callers get a typed CloudError::NotConfigured
whose Display message points at https://webclaw.io/signup.
Self-hosters without a webclaw.io account know exactly what to do.
## New extractors (all auto-dispatched \u2014 unique hosts)
- amazon_product: ASIN extraction from /dp/, /gp/product/,
/product/, /exec/obidos/ASIN/ URL shapes across every amazon.*
locale. Parses the Product JSON-LD Amazon ships for SEO; falls
back to #productTitle and #landingImage DOM selectors when
JSON-LD is absent. Returns price, currency, availability,
condition, brand, image, aggregate rating, SKU / MPN.
- ebay_listing: item-id extraction from /itm/{id} and
/itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr /
.it. Parses both bare Offer (Buy It Now) and AggregateOffer
(used-copies / auctions) from the Product JSON-LD. Returns
price or low/high-price range, currency, condition, seller,
offer_count, aggregate rating.
- trustpilot_reviews: reactivated from the `trustpilot_reviews`
file that was previously dead-code'd. Parser already worked; it
just needed the smart_fetch_html path to get past AWS WAF's
'Verifying Connection' interstitial. Organisation / LocalBusiness
JSON-LD block gives aggregate rating + up to 20 recent reviews.
## FetchClient change
- Added optional `cloud: Option<Arc<CloudClient>>` field with
`FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)`
accessor. Extractors call client.cloud() to decide whether they
can escalate. Cheap clones (Arc-wrapped).
## webclaw-server wiring
AppState::new() now reads the cloud credential from env:
1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the
server's own inbound bearer token.
2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open
mode (no inbound-auth key set), matching the MCP / CLI
convention of that env var.
When present, state.rs builds a CloudClient and attaches it to the
FetchClient via with_cloud(). Log line at startup so operators see
when cloud fallback is active.
## Catalog + dispatch
All three extractors registered in list() and in dispatch_by_url.
/v1/extractors catalog now exposes 22 verticals. Explicit
/v1/scrape/{vertical} routes work per the existing pattern.
## Tests
- 7 new unit tests (parse_asin multi-shape + parse from JSON-LD
fixture + DOM-fallback on missing JSON-LD for Amazon; ebay
URL-matching + slugged-URL parsing + both Offer and AggregateOffer
fixtures).
- Full extractors suite: 68 passing (was 59, +9 from the new files).
- fmt + clippy clean.
- No live-test story for these three inside CI \u2014 verifying them
means having WEBCLAW_CLOUD_API_KEY set against a real cloud
backend. Integration-test harness is a separate follow-up.
Catalog summary: 22 verticals total across wave 1-5. Hard-site
three are gated behind an actionable cloud-fallback upgrade path
rather than silently returning nothing or 403-ing the caller.
3 social-network extractors that work entirely without auth, using
public embed/preview endpoints + Instagram's own SEO-facing API:
- linkedin_post: /embed/feed/update/{urn} returns full body,
author, image, OG tags. Accepts both the urn:li:share
and urn:li:activity URN forms plus the pretty
/posts/{slug}-{id}-{suffix} URLs.
- instagram_post: /p/{shortcode}/embed/captioned/ returns the full
caption, username, thumbnail. Same endpoint serves
reels and IGTV, kind correctly classified.
- instagram_profile: /api/v1/users/web_profile_info/?username=X with the
x-ig-app-id header (Instagram's public web-app id,
sent by their own JS bundle). Returns the full
profile + the 12 most recent posts with shortcodes,
kinds, like/comment counts, thumbnails, and caption
previews. Falls back to OG-tag scraping of the
public HTML if the API ever 401/403s.
The IG profile output is shaped so callers can fan out cleanly:
for p in profile.recent_posts:
scrape('instagram_post', p.url)
giving you 'whole profile + every recent post' in one loop. End-to-end
tested against ticketswave: 1 profile call + 12 post calls in ~3.5s.
Pagination beyond 12 posts requires authenticated cookies and is left
for the cloud where we can stash a session.
Infrastructure change: added FetchClient::fetch_with_headers so
extractors can satisfy site-specific request headers (here x-ig-app-id;
later github_pr will use this for Authorization, etc.) without polluting
the global FetchConfig.headers map. Same retry semantics as fetch().
Catalog now exposes 17 extractors via /v1/extractors. Total unit tests
across the module: 47 passing. Clippy clean. Fmt clean.
Live test on the maintainer's example URLs:
- LinkedIn post (urn:li:share:7452618582213144577): 'Orc Dev' / full body
/ shipper.club link / CDN image extracted in 250ms.
- Instagram post (DT-RICMjeK5): 835-char Slovak caption, ticketswave
username, thumbnail. 200ms.
- Instagram profile (ticketswave): 18,473 followers (exact, not
rounded), is_verified=True, is_business=True, biography with emojis,
12 recent posts with shortcodes + kinds + likes. 400ms.
Out of scope for this wave (require infra we don't have):
- linkedin_profile: returns 999 to all bot UAs, needs OAuth
- facebook_post / facebook_page: content is JS-loaded, needs cloud Chrome
- facebook_profile (personal): not publicly accessible by design
* feat(fetch,llm): DoS hardening via response caps + glob validation (P2)
Response body caps:
- webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks
Content-Length up front (before the allocation) and the actual
.bytes() length after (belt-and-braces against lying upstreams).
Previously the HTML -> markdown conversion downstream could allocate
multiple String copies per page; a 100 MB page would OOM the process.
- webclaw-llm providers (anthropic/openai/ollama) share a new
response_json_capped helper with a 5 MB cap. Protects against a
malicious or runaway provider response exhausting memory.
Crawler frontier cap: after each BFS depth level the frontier is
truncated to max(max_pages * 10, 100) entries, keeping the most
recently discovered links. Dense pages (tag clouds, search results)
used to push the frontier into the tens of thousands even after
max_pages halted new fetches.
Glob pattern validation: user-supplied include_patterns /
exclude_patterns are rejected at Crawler::new if they contain more
than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher
degrades exponentially on deeply-nested `**` against long paths.
Cleanup:
- Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs;
no warnings surfaced, the suppression was obsolete.
- core/.gitignore: replaced overbroad *.json with specific local-
artifact patterns (previous rule would have swallowed package.json,
components.json, .smithery/*.json).
Tests: +4 validate_glob tests. Full workspace test: 283 passed
(webclaw-core + webclaw-fetch + webclaw-llm).
Version: 0.3.15 -> 0.3.16
CHANGELOG updated.
Refs: docs/AUDIT-2026-04-16.md (P2 section)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* chore: gitignore CLI research dumps, drop accidentally-tracked file
research-*.json output from `webclaw ... --research ...` got silently
swept into git by the relaxed *.json gitignore in the preceding commit.
The old blanket *.json rule was hiding both this legitimate scratch
file AND packages/create-webclaw/server.json (MCP registry config that
we DO want tracked).
Removes the research dump from git and adds a narrower research-*.json
ignore pattern so future CLI output doesn't get re-tracked by accident.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Three call sites in webclaw-fetch used .expect("semaphore closed") on
`Semaphore::acquire()`. Under normal operation they never fire, but
under a shutdown race or adversarial runtime state the spawned task
would panic and be silently dropped from the batch / crawl run — the
caller would see fewer results than URLs with no indication why.
Rewritten to match on the acquire result:
- client::fetch_batch and client::fetch_and_extract_batch_with_options
now emit BatchResult/BatchExtractResult carrying
FetchError::Build("semaphore closed before acquire").
- crawler's inner loop emits a failed PageResult with the same error
string instead of panicking.
Behaviorally a no-op for the happy path. Fixes the silent-dropped-task
class of bug noted in the 2026-04-16 audit.
Version: 0.3.14 -> 0.3.15
CHANGELOG updated.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Stress testing showed 33% of proxies are dead, causing 30s+ timeouts
per request with 3 retries (worst case 94s). Reducing timeout from 30s
to 12s and retries from 3 to 2 brings worst case to 25s. Combined with
disabling 509 dead proxies from the pool, this should significantly
improve response times under load.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.
This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.
84% pass rate across 1000 real sites. 384 unit tests green.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a fetch returns a challenge page (small HTML with Akamai markers),
automatically visit the homepage first to collect _abck/bm_sz cookies,
then retry the original URL. This bypasses Akamai's cookie-based gate
on subpages without needing JS execution.
Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr
sensor marker on responses under 15KB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension
New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
--only-main-content, --include, and --exclude were ignored in batch
mode because run_batch used default ExtractionOptions. Added
fetch_and_extract_batch_with_options to pass CLI options through.
Closes#3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reddit blocks TLS-fingerprinted clients on their .json API but
accepts standard requests with a browser User-Agent. Switch to
a non-impersonated primp client for the Reddit fallback path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>