Three hard-site extractors that all require antibot bypass to ever
return usable data. They ship in OSS so the parsers + schema live
with the rest of the vertical extractors, but the fetch path routes
through cloud::smart_fetch_html \u2014 meaning:
- With WEBCLAW_CLOUD_API_KEY configured on webclaw-server (or
WEBCLAW_API_KEY in MCP / CLI), local fetch is tried first; on
challenge-page detection we escalate to api.webclaw.io/v1/scrape
with formats=['html'] and parse the antibot-bypassed HTML locally.
- Without a cloud key, callers get a typed CloudError::NotConfigured
whose Display message points at https://webclaw.io/signup.
Self-hosters without a webclaw.io account know exactly what to do.
## New extractors (all auto-dispatched \u2014 unique hosts)
- amazon_product: ASIN extraction from /dp/, /gp/product/,
/product/, /exec/obidos/ASIN/ URL shapes across every amazon.*
locale. Parses the Product JSON-LD Amazon ships for SEO; falls
back to #productTitle and #landingImage DOM selectors when
JSON-LD is absent. Returns price, currency, availability,
condition, brand, image, aggregate rating, SKU / MPN.
- ebay_listing: item-id extraction from /itm/{id} and
/itm/{slug}/{id} URLs across ebay.com / .co.uk / .de / .fr /
.it. Parses both bare Offer (Buy It Now) and AggregateOffer
(used-copies / auctions) from the Product JSON-LD. Returns
price or low/high-price range, currency, condition, seller,
offer_count, aggregate rating.
- trustpilot_reviews: reactivated from the `trustpilot_reviews`
file that was previously dead-code'd. Parser already worked; it
just needed the smart_fetch_html path to get past AWS WAF's
'Verifying Connection' interstitial. Organisation / LocalBusiness
JSON-LD block gives aggregate rating + up to 20 recent reviews.
## FetchClient change
- Added optional `cloud: Option<Arc<CloudClient>>` field with
`FetchClient::with_cloud(cloud) -> Self` builder + `cloud(&self)`
accessor. Extractors call client.cloud() to decide whether they
can escalate. Cheap clones (Arc-wrapped).
## webclaw-server wiring
AppState::new() now reads the cloud credential from env:
1. WEBCLAW_CLOUD_API_KEY \u2014 preferred, disambiguates from the
server's own inbound bearer token.
2. WEBCLAW_API_KEY \u2014 fallback only when the server is in open
mode (no inbound-auth key set), matching the MCP / CLI
convention of that env var.
When present, state.rs builds a CloudClient and attaches it to the
FetchClient via with_cloud(). Log line at startup so operators see
when cloud fallback is active.
## Catalog + dispatch
All three extractors registered in list() and in dispatch_by_url.
/v1/extractors catalog now exposes 22 verticals. Explicit
/v1/scrape/{vertical} routes work per the existing pattern.
## Tests
- 7 new unit tests (parse_asin multi-shape + parse from JSON-LD
fixture + DOM-fallback on missing JSON-LD for Amazon; ebay
URL-matching + slugged-URL parsing + both Offer and AggregateOffer
fixtures).
- Full extractors suite: 68 passing (was 59, +9 from the new files).
- fmt + clippy clean.
- No live-test story for these three inside CI \u2014 verifying them
means having WEBCLAW_CLOUD_API_KEY set against a real cloud
backend. Integration-test harness is a separate follow-up.
Catalog summary: 22 verticals total across wave 1-5. Hard-site
three are gated behind an actionable cloud-fallback upgrade path
rather than silently returning nothing or 403-ing the caller.
Reddit blocks wreq's Chrome 145 BoringSSL fingerprint at the JA3/JA4
TLS layer even though our HTTP headers correctly impersonate Chrome.
Curl from the same machine with the same Chrome User-Agent string
returns 200 from Reddit's .json endpoint; webclaw with the Chrome
profile returns 403. The detector clearly fingerprints below the
header layer.
Tested all six vertical extractors with the Firefox profile:
reddit, hackernews, github_repo, pypi, npm, huggingface_model all
return correct typed JSON. Firefox is a strict improvement on the
Chrome default for sites with active TLS-level bot detection, with
no regressions on the API-flavored sites that were already working.
Real fix is per-extractor preferred profile, but the structural
change to allow per-call profile selection in FetchClient is a
larger refactor. Flipping the global default is a one-line change
that ships the unblock now and lets users hit the new
/v1/scrape/{vertical} routes against Reddit immediately.
New extractors module returns site-specific typed JSON instead of
generic markdown. Each extractor:
- declares a URL pattern via matches()
- fetches from the site's official JSON API where one exists
- returns a typed serde_json::Value with documented field names
- exposes an INFO struct that powers the /v1/extractors catalog
First 6 verticals shipped, all hitting public JSON APIs (no HTML
scraping, zero antibot risk):
- reddit → www.reddit.com/*/.json
- hackernews → hn.algolia.com/api/v1/items/{id} (full thread in one call)
- github_repo → api.github.com/repos/{owner}/{repo}
- pypi → pypi.org/pypi/{name}/json
- npm → registry.npmjs.org/{name} + downloads/point/last-week
- huggingface_model → huggingface.co/api/models/{owner}/{name}
Server-side routes added:
- POST /v1/scrape/{vertical} explicit per-vertical extraction
- GET /v1/extractors catalog (name, label, description, url_patterns)
The dispatcher validates that URL matches the requested vertical
before running, so users get "URL doesn't match the X extractor"
instead of opaque parse failures inside the extractor.
17 unit tests cover URL matching + path parsing for each vertical.
Live tests against canonical URLs (rust-lang/rust, requests pypi,
react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post)
all return correct typed JSON in 100-300ms. Sample sizes: github
863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree).
Marketing positioning: Firecrawl charges 5 credits per /extract call
and you write the schema. Webclaw returns the same JSON in 1 credit
per /scrape/{vertical} call with hand-written deterministic
extractors per site.
Self-hosters hitting docs/self-hosting were promised three binaries
but the OSS Docker image only shipped two. webclaw-server lived in
the closed-source hosted-platform repo, which couldn't be opened. This
adds a minimal axum REST API in the OSS repo so self-hosting actually
works without pretending to ship the cloud platform.
Crate at crates/webclaw-server/. Stateless, no database, no job queue,
single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map,
batch, extract, summarize, diff, brand}. JSON shapes mirror
api.webclaw.io for the endpoints OSS can support, so swapping between
self-hosted and hosted is a base-URL change.
Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison
is constant-time (subtle::ConstantTimeEq). Open mode (no key) is
allowed and binds 127.0.0.1 by default; the Docker image flips
WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box.
Hard caps to keep naive callers from OOMing the process: crawl capped
at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent.
For unbounded crawls or anti-bot bypass the docs point users at the
hosted API.
Dockerfile + Dockerfile.ci updated to copy webclaw-server into
/usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0
(new public binary).