mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-04-25 00:06:21 +02:00
Compare commits
33 commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
a5c3433372 | ||
|
|
966981bc42 | ||
|
|
866fa88aa0 | ||
|
|
b413d702b2 | ||
|
|
98a177dec4 | ||
|
|
e1af2da509 | ||
|
|
2285c585b1 | ||
|
|
b77767814a | ||
|
|
4bf11d902f | ||
|
|
0daa2fec1a | ||
|
|
058493bc8f | ||
|
|
aaa5103504 | ||
|
|
2373162c81 | ||
|
|
b2e7dbf365 | ||
|
|
e10066f527 | ||
|
|
a53578e45c | ||
|
|
7f5eb93b65 | ||
|
|
8cc727c2f2 | ||
|
|
d8c9274a9c | ||
|
|
0ab891bd6b | ||
|
|
0221c151dc | ||
|
|
3bb0a4bca0 | ||
|
|
b041f3cddd | ||
|
|
86182ef28a | ||
|
|
8ba7538c37 | ||
|
|
ccdb6d364b | ||
|
|
eff914e84f | ||
|
|
c7e5abea8f | ||
|
|
d71eebdacc | ||
|
|
d91ad9c1f4 | ||
|
|
2ba682adf3 | ||
|
|
b4bfff120e | ||
|
|
e27ee1f86f |
77 changed files with 13193 additions and 566 deletions
12
.github/workflows/release.yml
vendored
12
.github/workflows/release.yml
vendored
|
|
@ -66,8 +66,14 @@ jobs:
|
|||
tag="${GITHUB_REF#refs/tags/}"
|
||||
staging="webclaw-${tag}-${{ matrix.target }}"
|
||||
mkdir "$staging"
|
||||
cp target/${{ matrix.target }}/release/webclaw "$staging/" 2>/dev/null || true
|
||||
cp target/${{ matrix.target }}/release/webclaw-mcp "$staging/" 2>/dev/null || true
|
||||
# Fail loud if any binary is missing. A silent `|| true` on the
|
||||
# copy was how v0.4.0 shipped tarballs that lacked webclaw-server —
|
||||
# don't repeat that mistake. If a future binary gets renamed or
|
||||
# removed, this step should scream, not quietly publish an
|
||||
# incomplete release.
|
||||
cp target/${{ matrix.target }}/release/webclaw "$staging/"
|
||||
cp target/${{ matrix.target }}/release/webclaw-mcp "$staging/"
|
||||
cp target/${{ matrix.target }}/release/webclaw-server "$staging/"
|
||||
cp README.md LICENSE "$staging/"
|
||||
tar czf "$staging.tar.gz" "$staging"
|
||||
echo "ASSET=$staging.tar.gz" >> $GITHUB_ENV
|
||||
|
|
@ -134,6 +140,7 @@ jobs:
|
|||
mkdir -p "binaries-${target}"
|
||||
cp "${dir}/webclaw" "binaries-${target}/webclaw"
|
||||
cp "${dir}/webclaw-mcp" "binaries-${target}/webclaw-mcp"
|
||||
cp "${dir}/webclaw-server" "binaries-${target}/webclaw-server"
|
||||
chmod +x "binaries-${target}"/*
|
||||
done
|
||||
ls -laR binaries-*/
|
||||
|
|
@ -220,6 +227,7 @@ jobs:
|
|||
def install
|
||||
bin.install "webclaw"
|
||||
bin.install "webclaw-mcp"
|
||||
bin.install "webclaw-server"
|
||||
end
|
||||
|
||||
test do
|
||||
|
|
|
|||
110
CHANGELOG.md
110
CHANGELOG.md
|
|
@ -3,6 +3,116 @@
|
|||
All notable changes to webclaw are documented here.
|
||||
Format follows [Keep a Changelog](https://keepachangelog.com/).
|
||||
|
||||
## [0.5.6] — 2026-04-23
|
||||
|
||||
### Added
|
||||
- `FetchClient::fetch_smart(url)` applies per-site rescue logic and returns the same `FetchResult` shape as `fetch()`. Reddit URLs route to the `.json` API with an identifiable bot `User-Agent`, and Akamai-style challenge pages trigger a homepage cookie warmup plus a retry. Makes `/v1/scrape` on Reddit populate markdown again.
|
||||
|
||||
### Fixed
|
||||
- Regression introduced in 0.5.4 where the production server's `/v1/scrape` bypassed the Reddit `.json` shortcut and Akamai cookie warmup that `fetch_and_extract` had been providing. Both helpers now live in `fetch_smart` and every caller path picks them up.
|
||||
- Panic in the markdown converter (`markdown.rs:925`) on single-pipe `|` lines. A `[1..len-1]` slice on a 1-char input triggered `begin <= end`. Guarded.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.5] — 2026-04-23
|
||||
|
||||
### Added
|
||||
- `webclaw --browser safari-ios` on the CLI. Pairs with `--proxy` for DataDome-fronted sites that reject desktop profiles.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.4] — 2026-04-23
|
||||
|
||||
### Added
|
||||
- New `BrowserProfile::SafariIos` for Safari iOS 26 fingerprinting. Pairs with a country-matched residential proxy for sites that reject non-mobile profiles.
|
||||
- `accept_language_for_url(url)` and `accept_language_for_tld(tld)` helpers. Returns a locale-appropriate `Accept-Language` based on the URL's TLD, with `en-US` as the fallback.
|
||||
|
||||
### Changed
|
||||
- Chrome browser fingerprint refreshed for current Cloudflare bot management. Fixes 403 challenges on several e-commerce and jobs sites.
|
||||
- Bumped `wreq-util` to `3.0.0-rc.10`.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.2] — 2026-04-22
|
||||
|
||||
### Added
|
||||
- **`webclaw vertical <name> <url>` subcommand on the CLI.** Runs a specific vertical extractor and prints typed JSON (pretty-printed by default, `--raw` for single-line). Example: `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/` returns `{post: {title, author, points, ...}, comments: [...]}`. URL-mismatch errors surface cleanly as `"URL '...' does not match the '...' extractor"` on stderr with exit code 1.
|
||||
|
||||
- **`webclaw extractors` subcommand on the CLI.** Lists all 28 vertical extractors with name, label, and one URL pattern sample. `--json` emits the full catalog as JSON (same shape as `GET /v1/extractors`) for tooling. Covers discovery for users who don't know which vertical to pick.
|
||||
|
||||
- **`vertical_scrape` and `list_extractors` tools on `webclaw-mcp`.** Claude Desktop / Claude Code users can now call any of the 28 extractors by name from an MCP session. Tool count goes from 10 to 12. `list_extractors` takes no args and returns the full catalog; `vertical_scrape` takes `{name, url}` and returns the typed JSON payload. Antibot-gated verticals still auto-escalate to the webclaw cloud API when `WEBCLAW_API_KEY` is set.
|
||||
|
||||
### Changed
|
||||
- Server-info instruction string in `webclaw-mcp` now lists all 12 tools (previously hard-coded 10). Also `webclaw --help` on the CLI now shows the three subcommands: `bench`, `extractors`, `vertical`.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.1] — 2026-04-22
|
||||
|
||||
### Added
|
||||
- **`webclaw_fetch::Fetcher` trait.** Vertical extractors now consume `&dyn Fetcher` instead of `&FetchClient` directly. The trait exposes three methods (`fetch`, `fetch_with_headers`, `cloud`) covering everything extractors need. Callers that already held a `FetchClient` keep working unchanged: `FetchClient` implements `Fetcher`, blanket impls cover `&T` and `Arc<T>`, so `&client` coerces to `&dyn Fetcher` automatically.
|
||||
|
||||
The motivation is the split between OSS (wreq-backed, in-process TLS fingerprinting) and the production API server at api.webclaw.io (which cannot use in-process fingerprinting per the architecture rule, and must delegate HTTP through the Go tls-sidecar). Before this trait, adding vertical routes to the production server would have required importing wreq into its dependency graph, violating the separation. Now the production server can provide its own `TlsSidecarFetcher` implementation and pass it to the same extractor dispatcher the OSS server uses.
|
||||
|
||||
Backwards compatible. No behavior change for CLI, MCP, or OSS self-host.
|
||||
|
||||
### Changed
|
||||
- All 28 extractor `extract()` signatures migrated from `client: &FetchClient` to `client: &dyn Fetcher`. The dispatcher functions (`extractors::dispatch_by_url`, `extractors::dispatch_by_name`) and the cloud escalation helpers (`cloud::smart_fetch`, `cloud::smart_fetch_html`) follow the same change. Tests and call sites are unchanged because `&FetchClient` auto-coerces.
|
||||
|
||||
---
|
||||
|
||||
## [0.5.0] — 2026-04-22
|
||||
|
||||
### Added
|
||||
- **28 vertical extractors that return typed JSON instead of generic markdown.** New `webclaw_fetch::extractors` module with one extractor per site. Dev: reddit, hackernews, github_repo / github_pr / github_issue / github_release, crates_io, pypi, npm. AI/ML: huggingface_model, huggingface_dataset, arxiv, docker_hub. Writing: dev_to, stackoverflow, youtube_video. Social: linkedin_post, instagram_post, instagram_profile. Ecommerce: shopify_product, shopify_collection, ecommerce_product (generic Schema.org), woocommerce_product, amazon_product, ebay_listing, etsy_listing. Reviews: trustpilot_reviews, substack_post. Each extractor claims a URL pattern via a public `matches()` fn and returns a typed JSON payload with the fields callers actually want (title, price, author, rating, review count, etc.) rather than a markdown blob.
|
||||
- **`POST /v1/scrape/{vertical}` on `webclaw-server` for explicit vertical routing.** Picks the parser by name, validates the URL plausibly belongs to that vertical, returns the same shape as `POST /v1/scrape` but typed. 23 of 28 verticals also auto-dispatch from a plain `POST /v1/scrape` because their URL shapes are unique enough to claim safely; the remaining 5 (`shopify_product`, `shopify_collection`, `ecommerce_product`, `woocommerce_product`, `substack_post`) use patterns that non-target sites share, so callers opt in via the `{vertical}` route.
|
||||
- **`GET /v1/extractors` on `webclaw-server`.** Returns the full catalog as `{"extractors": [{"name": "...", "label": "...", "description": "...", "url_patterns": [...]}, ...]}` so clients can build tooling / autocomplete / user-facing docs off a live source.
|
||||
- **Antibot cloud-escalation for 5 ecommerce + reviews verticals.** Amazon, eBay, Etsy, Trustpilot, and Substack (as HTML fallback) go through `cloud::smart_fetch_html`: try local fetch first; on bot-protection detection (Cloudflare challenge, DataDome, AWS WAF "Verifying your connection", etc.) escalate to `api.webclaw.io/v1/scrape`. Without `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` the extractor returns a typed `CloudError::NotConfigured` with an actionable signup link. With a key set, escalation is automatic. Every extractor stamps a `data_source: "local" | "cloud"` field on the response so callers can tell which path ran.
|
||||
- **`cloud::synthesize_html` for cloud-bypassed extraction.** `api.webclaw.io/v1/scrape` deliberately does not return raw HTML; it returns a parsed bundle (`structured_data` JSON-LD blocks + `metadata` OG/meta tags + `markdown`). The new helper reassembles that bundle back into a minimal synthetic HTML doc (JSON-LD as `<script>` tags, metadata as OG `<meta>` tags, markdown in a `<pre>`) so existing local parsers run unchanged across both paths. No per-extractor code path branches are needed for "came from cloud" vs "came from local".
|
||||
- **Trustpilot 2025 schema parser.** Trustpilot replaced their single-Organization + aggregateRating shape with three separate JSON-LD blocks: a site-level Organization (Trustpilot itself), a Dataset with a csvw:Table `mainEntity` carrying the per-star distribution for the target business, and an aiSummary + aiSummaryReviews block with the AI-generated summary and recent reviews. The parser walks all three, skips the site-level Org, picks the Dataset by `about.@id` matching the target domain, parses each csvw:column for rating buckets, computes weighted-average rating + total from the distribution, extracts the aiSummary text, and returns recent reviews with author / country / date / rating / title / text / likes.
|
||||
- **OG-tag fallback in `ecommerce_product` for sites with no JSON-LD and sites with JSON-LD but empty offers.** Three paths now: `jsonld` (Schema.org Product with offers), `jsonld+og` (Product JSON-LD plus OG product tags filling in missing price), and `og_fallback` (no JSON-LD at all, build minimal payload from `og:title`, `og:image`, `og:description`, `product:price:amount`, `product:price:currency`, `product:availability`, `product:brand`). `has_og_product_signal()` gates the fallback on `og:type=product` or a price tag so blog posts don't get mis-classified as products.
|
||||
- **URL-slug title fallback in `etsy_listing` for delisted / blocked pages.** When Etsy serves a placeholder page (`"etsy.com"`, `"Etsy - Your place to buy..."`, `"This item is unavailable"`), humanise the URL slug (`/listing/123/personalized-stainless-steel-tumbler` becomes `"Personalized Stainless Steel Tumbler"`) so callers always get a meaningful title. Plus shop falls through `offers[].seller.name` then top-level `brand` because Etsy uses both schemas depending on listing age.
|
||||
- **Force-cloud-escalation in `amazon_product` when local HTML lacks Product JSON-LD.** Amazon A/B-tests JSON-LD presence. When local fetch succeeds but has no `Product` block and a cloud client is configured, the extractor force-escalates to the cloud which reliably surfaces title + description via its render engine. Added OG meta-tag fallback so the cloud's synthesized HTML output (OG tags only, no Amazon DOM IDs) still yields title / image / description.
|
||||
- **AWS WAF "Verifying your connection" detector in `cloud::is_bot_protected`.** Trustpilot serves a `~565` byte interstitial with an `interstitial-spinner` CSS class. The detector now fires on that pattern with a `< 10_000` byte size gate to avoid false positives on real articles that happen to mention the phrase.
|
||||
|
||||
### Changed
|
||||
- **`webclaw-fetch::FetchClient` gained an optional `cloud` field** via `with_cloud(CloudClient)`. Extractors reach it through `client.cloud()` to decide whether to escalate. `webclaw-server::AppState` reads `WEBCLAW_CLOUD_API_KEY` (preferred) or falls back to `WEBCLAW_API_KEY` only when inbound auth is not configured (open mode).
|
||||
- **Consolidated `CloudClient` into `webclaw-fetch`.** Previously duplicated between `webclaw-mcp/src/cloud.rs` (302 LOC) and `webclaw-cli/src/cloud.rs` (80 LOC). Single canonical home with typed `CloudError` (`NotConfigured`, `Unauthorized`, `InsufficientPlan`, `RateLimited`, `ServerError`, `Network`, `ParseFailed`) that Display with actionable URLs; `From<CloudError> for String` bridge keeps pre-existing CLI / MCP call sites compiling unchanged during migration.
|
||||
|
||||
### Tests
|
||||
- 215 unit tests passing in `webclaw-fetch` (100+ new, covering every extractor's matcher, URL parser, JSON-LD / OG fallback paths, and the cloud synthesis helper). `cargo clippy --workspace --release --no-deps` clean.
|
||||
|
||||
---
|
||||
|
||||
## [0.4.0] — 2026-04-22
|
||||
|
||||
### Added
|
||||
- **`webclaw bench <url>` — per-URL extraction micro-benchmark (#26).** New subcommand. Fetches a URL once, runs the same extraction pipeline as `--format llm`, and prints a small ASCII table comparing raw-HTML tokens vs. llm-output tokens, bytes, and extraction time. Pass `--json` for a single-line JSON object (stable shape, easy to append to ndjson in CI). Pass `--facts <path>` with a file in the same schema as `benchmarks/facts.json` to get a fidelity column ("4/5 facts preserved"); URLs absent from the facts file produce no fidelity row, so uncurated sites aren't shown as 0/0. v1 uses an approximate tokenizer (`chars/4` for Latin text, `chars/2` when CJK dominates) — off by ±10% vs. a real BPE tokenizer, but the signal ("the LLM pipeline dropped 93% of the raw bytes") is the point. Output clearly labels counts as `≈ tokens` so nobody confuses them with a real tiktoken run. Swapping in `tiktoken-rs` later is a one-function change in `bench.rs`. Adding this as a `clap` subcommand rather than a flag also lays the groundwork for future subcommands without breaking the existing flag-based flow — `webclaw <url> --format llm` still works exactly as before.
|
||||
|
||||
- **`webclaw-server` — new OSS binary for self-hosting a REST API (#29).** Until now, `docs/self-hosting` promised a `webclaw-server` binary that only existed in the hosted-platform repo (closed source). The Docker image shipped two binaries while the docs advertised three, which sent self-hosters into a bug loop. This release closes the gap: a new crate at `crates/webclaw-server/` builds a minimal, stateless axum server that exposes the OSS extraction pipeline over HTTP with the same JSON shapes as api.webclaw.io. Endpoints: `GET /health`, `POST /v1/{scrape,crawl,map,batch,extract,summarize,diff,brand}`. Run with `webclaw-server --port 3000 [--host 0.0.0.0] [--api-key <bearer>]` or the matching `WEBCLAW_PORT` / `WEBCLAW_HOST` / `WEBCLAW_API_KEY` env vars. Bearer auth is constant-time (via `subtle::ConstantTimeEq`); open mode (no key) is allowed on `127.0.0.1` for local development.
|
||||
|
||||
What self-hosting gives you: the full extraction pipeline, Crawler, sitemap discovery, brand/diff, LLM extract/summarize (via Ollama or your own OpenAI/Anthropic key). What it does *not* give you: anti-bot bypass (Cloudflare, DataDome, WAFs), headless JS rendering, async job queues, multi-tenant auth/billing, domain-hints and proxy routing — those require the hosted backend at api.webclaw.io and are intentionally not open-source. The self-hosting docs have been updated to reflect this split honestly.
|
||||
|
||||
- **`crawl` endpoint runs synchronously and hard-caps at 500 pages / 20 concurrency.** No job queue, no background workers — a naive caller can't OOM the process. `batch` caps at 100 URLs / 20 concurrency for the same reason. For unbounded crawls use the hosted API.
|
||||
|
||||
### Changed
|
||||
- **Docker image now ships three binaries**, not two. `Dockerfile` and `Dockerfile.ci` both add `webclaw-server` to `/usr/local/bin/` and `EXPOSE 3000` for documentation. The entrypoint shim is unchanged: `docker run IMAGE webclaw-server --port 3000` Just Works, and the CLI/URL pass-through from v0.3.19 is preserved.
|
||||
|
||||
### Docs
|
||||
- Rewrote `docs/self-hosting` on the landing site to differentiate OSS (self-hosted REST) from the hosted platform. Added a capability matrix so new users don't have to read the repo to figure out why Cloudflare-protected sites still 403 when pointing at their own box.
|
||||
|
||||
### Fixed
|
||||
- **Dead-code warning on `cargo install webclaw-mcp` (#30).** `rmcp` 1.3.x changed how the `#[tool_handler]` macro reads the `tool_router` struct field — it now goes through a derived trait impl instead of referencing the field by name, so rustc's dead-code lint no longer sees it. The field is still essential (dropping it unregisters every MCP tool), just invisible to the lint. Annotated with `#[allow(dead_code)]` and a comment explaining why. No behaviour change. Warning disappears on the next `cargo install`.
|
||||
|
||||
---
|
||||
|
||||
## [0.3.19] — 2026-04-17
|
||||
|
||||
### Fixed
|
||||
- **Docker image can be used as a FROM base again.** v0.3.13 switched the Docker `CMD` to `ENTRYPOINT ["webclaw"]` so that `docker run IMAGE https://example.com` would pass the URL through as expected. That change trapped a different use case: downstream Dockerfiles that `FROM ghcr.io/0xmassi/webclaw` and set their own `CMD ["./setup.sh"]` — the child's `./setup.sh` became the first arg to `webclaw`, which tried to fetch it as a URL and failed with `error sending request for uri (https://./setup.sh)`. Both `Dockerfile` and `Dockerfile.ci` now use a small `docker-entrypoint.sh` shim that forwards flags (`-*`) and URLs (`http://`, `https://`) to `webclaw`, but `exec`s anything else directly. All four use cases now work: `docker run IMAGE https://example.com`, `docker run IMAGE --help`, child-image `CMD ["./setup.sh"]`, and `docker run IMAGE bash` for debugging. Default `CMD` is `["webclaw", "--help"]`.
|
||||
|
||||
---
|
||||
|
||||
## [0.3.18] — 2026-04-16
|
||||
|
||||
### Fixed
|
||||
|
|
|
|||
26
CLAUDE.md
26
CLAUDE.md
|
|
@ -11,7 +11,7 @@ webclaw/
|
|||
# + ExtractionOptions (include/exclude CSS selectors)
|
||||
# + diff engine (change tracking)
|
||||
# + brand extraction (DOM/CSS analysis)
|
||||
webclaw-fetch/ # HTTP client via primp. Crawler. Sitemap discovery. Batch ops.
|
||||
webclaw-fetch/ # HTTP client via wreq (BoringSSL). Crawler. Sitemap discovery. Batch ops.
|
||||
# + proxy pool rotation (per-request)
|
||||
# + PDF content-type detection
|
||||
# + document parsing (DOCX, XLSX, CSV)
|
||||
|
|
@ -20,9 +20,11 @@ webclaw/
|
|||
webclaw-pdf/ # PDF text extraction via pdf-extract
|
||||
webclaw-mcp/ # MCP server (Model Context Protocol) for AI agents
|
||||
webclaw-cli/ # CLI binary
|
||||
webclaw-server/ # Minimal axum REST API (self-hosting; OSS counterpart
|
||||
# of api.webclaw.io, without anti-bot / JS / jobs / auth)
|
||||
```
|
||||
|
||||
Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server).
|
||||
Three binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server), `webclaw-server` (REST API for self-hosting).
|
||||
|
||||
### Core Modules (`webclaw-core`)
|
||||
- `extractor.rs` — Readability-style scoring: text density, semantic tags, link density penalty
|
||||
|
|
@ -38,7 +40,7 @@ Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server).
|
|||
- `brand.rs` — Brand identity extraction from DOM structure and CSS
|
||||
|
||||
### Fetch Modules (`webclaw-fetch`)
|
||||
- `client.rs` — FetchClient with primp TLS impersonation
|
||||
- `client.rs` — FetchClient with wreq BoringSSL TLS impersonation; implements the public `Fetcher` trait so callers (including server adapters) can swap in alternative implementations
|
||||
- `browser.rs` — Browser profiles: Chrome (142/136/133/131), Firefox (144/135/133/128)
|
||||
- `crawler.rs` — BFS same-origin crawler with configurable depth/concurrency/delay
|
||||
- `sitemap.rs` — Sitemap discovery and parsing (sitemap.xml, robots.txt)
|
||||
|
|
@ -60,12 +62,24 @@ Two binaries: `webclaw` (CLI), `webclaw-mcp` (MCP server).
|
|||
- Works with Claude Desktop, Claude Code, and any MCP client
|
||||
- Uses `rmcp` crate (official Rust MCP SDK)
|
||||
|
||||
### REST API Server (`webclaw-server`)
|
||||
- Axum 0.8, stateless, no database, no job queue
|
||||
- 8 POST routes + /health, JSON shapes mirror api.webclaw.io where the
|
||||
capability exists in OSS
|
||||
- Constant-time bearer-token auth via `subtle::ConstantTimeEq` when
|
||||
`--api-key` / `WEBCLAW_API_KEY` is set; otherwise open mode
|
||||
- Hard caps: crawl ≤ 500 pages, batch ≤ 100 URLs, 20 concurrent
|
||||
- Does NOT include: anti-bot bypass, JS rendering, async jobs,
|
||||
multi-tenant auth, billing, proxy rotation, search/research/watch/
|
||||
agent-scrape. Those live behind api.webclaw.io and are closed-source.
|
||||
|
||||
## Hard Rules
|
||||
|
||||
- **Core has ZERO network dependencies** — takes `&str` HTML, returns structured output. Keep it WASM-compatible.
|
||||
- **primp requires `[patch.crates-io]`** for patched rustls/h2 forks at workspace level.
|
||||
- **RUSTFLAGS are set in `.cargo/config.toml`** — no need to pass manually.
|
||||
- **webclaw-llm uses plain reqwest** (NOT primp-patched). LLM APIs don't need TLS fingerprinting.
|
||||
- **webclaw-fetch uses wreq 6.x** (BoringSSL). No `[patch.crates-io]` forks needed; wreq handles TLS internally.
|
||||
- **No special RUSTFLAGS** — `.cargo/config.toml` is currently empty of build flags. Don't add any.
|
||||
- **webclaw-llm uses plain reqwest**. LLM APIs don't need TLS fingerprinting, so no wreq dep.
|
||||
- **Vertical extractors take `&dyn Fetcher`**, not `&FetchClient`. This lets the production server plug in a `ProductionFetcher` that adds domain_hints routing and antibot escalation on top of the same wreq client.
|
||||
- **qwen3 thinking tags** (`<think>`) are stripped at both provider and consumer levels.
|
||||
|
||||
## Build & Test
|
||||
|
|
|
|||
165
Cargo.lock
generated
165
Cargo.lock
generated
|
|
@ -182,6 +182,70 @@ version = "1.5.0"
|
|||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8"
|
||||
|
||||
[[package]]
|
||||
name = "axum"
|
||||
version = "0.8.9"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "31b698c5f9a010f6573133b09e0de5408834d0c82f8d7475a89fc1867a71cd90"
|
||||
dependencies = [
|
||||
"axum-core",
|
||||
"axum-macros",
|
||||
"bytes",
|
||||
"form_urlencoded",
|
||||
"futures-util",
|
||||
"http",
|
||||
"http-body",
|
||||
"http-body-util",
|
||||
"hyper",
|
||||
"hyper-util",
|
||||
"itoa",
|
||||
"matchit",
|
||||
"memchr",
|
||||
"mime",
|
||||
"percent-encoding",
|
||||
"pin-project-lite",
|
||||
"serde_core",
|
||||
"serde_json",
|
||||
"serde_path_to_error",
|
||||
"serde_urlencoded",
|
||||
"sync_wrapper",
|
||||
"tokio",
|
||||
"tower",
|
||||
"tower-layer",
|
||||
"tower-service",
|
||||
"tracing",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "axum-core"
|
||||
version = "0.5.6"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "08c78f31d7b1291f7ee735c1c6780ccde7785daae9a9206026862dab7d8792d1"
|
||||
dependencies = [
|
||||
"bytes",
|
||||
"futures-core",
|
||||
"http",
|
||||
"http-body",
|
||||
"http-body-util",
|
||||
"mime",
|
||||
"pin-project-lite",
|
||||
"sync_wrapper",
|
||||
"tower-layer",
|
||||
"tower-service",
|
||||
"tracing",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "axum-macros"
|
||||
version = "0.5.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "7aa268c23bfbbd2c4363b9cd302a4f504fb2a9dfe7e3451d66f35dd392e20aca"
|
||||
dependencies = [
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
"syn",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "base64"
|
||||
version = "0.22.1"
|
||||
|
|
@ -1132,6 +1196,12 @@ version = "1.10.1"
|
|||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87"
|
||||
|
||||
[[package]]
|
||||
name = "httpdate"
|
||||
version = "1.0.3"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "df3b46402a9d5adb4c86a0cf463f42e19994e3ee891101b1841f30a545cb49a9"
|
||||
|
||||
[[package]]
|
||||
name = "hyper"
|
||||
version = "1.9.0"
|
||||
|
|
@ -1145,6 +1215,7 @@ dependencies = [
|
|||
"http",
|
||||
"http-body",
|
||||
"httparse",
|
||||
"httpdate",
|
||||
"itoa",
|
||||
"pin-project-lite",
|
||||
"smallvec",
|
||||
|
|
@ -1559,6 +1630,12 @@ dependencies = [
|
|||
"regex-automata",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "matchit"
|
||||
version = "0.8.4"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "47e1ffaa40ddd1f3ed91f717a33c8c0ee23fff369e3aa8772b9605cc1d22f4c3"
|
||||
|
||||
[[package]]
|
||||
name = "md-5"
|
||||
version = "0.10.6"
|
||||
|
|
@ -1575,6 +1652,12 @@ version = "2.8.0"
|
|||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79"
|
||||
|
||||
[[package]]
|
||||
name = "mime"
|
||||
version = "0.3.17"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "6877bb514081ee2a7ff5ef9de3281f14a4dd4bceac4c09388074a6b5df8a139a"
|
||||
|
||||
[[package]]
|
||||
name = "minimal-lexical"
|
||||
version = "0.2.1"
|
||||
|
|
@ -2403,6 +2486,17 @@ dependencies = [
|
|||
"zmij",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "serde_path_to_error"
|
||||
version = "0.1.20"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "10a9ff822e371bb5403e391ecd83e182e0e77ba7f6fe0160b795797109d1b457"
|
||||
dependencies = [
|
||||
"itoa",
|
||||
"serde",
|
||||
"serde_core",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "serde_urlencoded"
|
||||
version = "0.7.1"
|
||||
|
|
@ -2757,6 +2851,7 @@ dependencies = [
|
|||
"tokio",
|
||||
"tower-layer",
|
||||
"tower-service",
|
||||
"tracing",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
|
|
@ -2780,6 +2875,7 @@ dependencies = [
|
|||
"tower",
|
||||
"tower-layer",
|
||||
"tower-service",
|
||||
"tracing",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
|
|
@ -2800,6 +2896,7 @@ version = "0.1.44"
|
|||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "63e71662fa4b2a2c3a26f570f037eb95bb1f85397f3cd8076caed2f026a6d100"
|
||||
dependencies = [
|
||||
"log",
|
||||
"pin-project-lite",
|
||||
"tracing-attributes",
|
||||
"tracing-core",
|
||||
|
|
@ -2870,6 +2967,26 @@ dependencies = [
|
|||
"pom",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "typed-builder"
|
||||
version = "0.23.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "31aa81521b70f94402501d848ccc0ecaa8f93c8eb6999eb9747e72287757ffda"
|
||||
dependencies = [
|
||||
"typed-builder-macro",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "typed-builder-macro"
|
||||
version = "0.23.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "076a02dc54dd46795c2e9c8282ed40bcfb1e22747e955de9389a1de28190fb26"
|
||||
dependencies = [
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
"syn",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "typed-path"
|
||||
version = "0.12.3"
|
||||
|
|
@ -3102,7 +3219,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-cli"
|
||||
version = "0.3.18"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"clap",
|
||||
"dotenvy",
|
||||
|
|
@ -3123,7 +3240,7 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-core"
|
||||
version = "0.3.18"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"ego-tree",
|
||||
"once_cell",
|
||||
|
|
@ -3141,13 +3258,16 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-fetch"
|
||||
version = "0.3.18"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"async-trait",
|
||||
"bytes",
|
||||
"calamine",
|
||||
"http",
|
||||
"quick-xml 0.37.5",
|
||||
"rand 0.8.5",
|
||||
"regex",
|
||||
"reqwest",
|
||||
"serde",
|
||||
"serde_json",
|
||||
"tempfile",
|
||||
|
|
@ -3158,12 +3278,13 @@ dependencies = [
|
|||
"webclaw-core",
|
||||
"webclaw-pdf",
|
||||
"wreq",
|
||||
"wreq-util",
|
||||
"zip 2.4.2",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "webclaw-llm"
|
||||
version = "0.3.18"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"async-trait",
|
||||
"reqwest",
|
||||
|
|
@ -3176,11 +3297,10 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-mcp"
|
||||
version = "0.3.18"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"dirs",
|
||||
"dotenvy",
|
||||
"reqwest",
|
||||
"rmcp",
|
||||
"schemars",
|
||||
"serde",
|
||||
|
|
@ -3197,13 +3317,34 @@ dependencies = [
|
|||
|
||||
[[package]]
|
||||
name = "webclaw-pdf"
|
||||
version = "0.3.18"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"pdf-extract",
|
||||
"thiserror",
|
||||
"tracing",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "webclaw-server"
|
||||
version = "0.5.6"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"axum",
|
||||
"clap",
|
||||
"serde",
|
||||
"serde_json",
|
||||
"subtle",
|
||||
"thiserror",
|
||||
"tokio",
|
||||
"tower-http",
|
||||
"tracing",
|
||||
"tracing-subscriber",
|
||||
"webclaw-core",
|
||||
"webclaw-fetch",
|
||||
"webclaw-llm",
|
||||
"webclaw-pdf",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "webpki-root-certs"
|
||||
version = "1.0.6"
|
||||
|
|
@ -3589,6 +3730,16 @@ dependencies = [
|
|||
"zstd",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "wreq-util"
|
||||
version = "3.0.0-rc.10"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "6c6bbe24d28beb9ceb58b514bd6a613c759d3b706f768b9d2950d5d35b543c04"
|
||||
dependencies = [
|
||||
"typed-builder",
|
||||
"wreq",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "writeable"
|
||||
version = "0.6.2"
|
||||
|
|
|
|||
|
|
@ -3,7 +3,7 @@ resolver = "2"
|
|||
members = ["crates/*"]
|
||||
|
||||
[workspace.package]
|
||||
version = "0.3.18"
|
||||
version = "0.5.6"
|
||||
edition = "2024"
|
||||
license = "AGPL-3.0"
|
||||
repository = "https://github.com/0xMassi/webclaw"
|
||||
|
|
|
|||
37
Dockerfile
37
Dockerfile
|
|
@ -1,5 +1,12 @@
|
|||
# webclaw — Multi-stage Docker build
|
||||
# Produces 2 binaries: webclaw (CLI) and webclaw-mcp (MCP server)
|
||||
# Produces 3 binaries:
|
||||
# webclaw — CLI (single-shot extraction, crawl, MCP-less use)
|
||||
# webclaw-mcp — MCP server (stdio, for AI agents)
|
||||
# webclaw-server — minimal REST API for self-hosting (OSS, stateless)
|
||||
#
|
||||
# NOTE: this is NOT the hosted API at api.webclaw.io — the cloud service
|
||||
# adds anti-bot bypass, JS rendering, multi-tenant auth and async jobs
|
||||
# that are intentionally not open-source. See docs/self-hosting.
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage 1: Build all binaries in release mode
|
||||
|
|
@ -25,6 +32,7 @@ COPY crates/webclaw-llm/Cargo.toml crates/webclaw-llm/Cargo.toml
|
|||
COPY crates/webclaw-pdf/Cargo.toml crates/webclaw-pdf/Cargo.toml
|
||||
COPY crates/webclaw-mcp/Cargo.toml crates/webclaw-mcp/Cargo.toml
|
||||
COPY crates/webclaw-cli/Cargo.toml crates/webclaw-cli/Cargo.toml
|
||||
COPY crates/webclaw-server/Cargo.toml crates/webclaw-server/Cargo.toml
|
||||
|
||||
# Copy .cargo config if present (optional build flags)
|
||||
COPY .cargo .cargo
|
||||
|
|
@ -35,7 +43,8 @@ RUN mkdir -p crates/webclaw-core/src && echo "" > crates/webclaw-core/src/lib.rs
|
|||
&& mkdir -p crates/webclaw-llm/src && echo "" > crates/webclaw-llm/src/lib.rs \
|
||||
&& mkdir -p crates/webclaw-pdf/src && echo "" > crates/webclaw-pdf/src/lib.rs \
|
||||
&& mkdir -p crates/webclaw-mcp/src && echo "fn main() {}" > crates/webclaw-mcp/src/main.rs \
|
||||
&& mkdir -p crates/webclaw-cli/src && echo "fn main() {}" > crates/webclaw-cli/src/main.rs
|
||||
&& mkdir -p crates/webclaw-cli/src && echo "fn main() {}" > crates/webclaw-cli/src/main.rs \
|
||||
&& mkdir -p crates/webclaw-server/src && echo "fn main() {}" > crates/webclaw-server/src/main.rs
|
||||
|
||||
# Pre-build dependencies (this layer is cached until Cargo.toml/lock changes)
|
||||
RUN cargo build --release 2>/dev/null || true
|
||||
|
|
@ -54,9 +63,27 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
|||
ca-certificates \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Copy both binaries
|
||||
# Copy all three binaries
|
||||
COPY --from=builder /build/target/release/webclaw /usr/local/bin/webclaw
|
||||
COPY --from=builder /build/target/release/webclaw-mcp /usr/local/bin/webclaw-mcp
|
||||
COPY --from=builder /build/target/release/webclaw-server /usr/local/bin/webclaw-server
|
||||
|
||||
# Default: run the CLI (ENTRYPOINT so args pass through)
|
||||
ENTRYPOINT ["webclaw"]
|
||||
# Default port the REST API listens on when you run `webclaw-server` inside
|
||||
# the container. Override with -e WEBCLAW_PORT=... or --port. Published only
|
||||
# as documentation; callers still need `-p 3000:3000` on `docker run`.
|
||||
EXPOSE 3000
|
||||
|
||||
# Container default: bind all interfaces so `-p 3000:3000` works. The binary
|
||||
# itself defaults to 127.0.0.1 (safe for `cargo run` on a laptop); inside
|
||||
# Docker that would make the server unreachable, so we flip it here.
|
||||
# Override with -e WEBCLAW_HOST=127.0.0.1 if you front this with another
|
||||
# process in the same container.
|
||||
ENV WEBCLAW_HOST=0.0.0.0
|
||||
|
||||
# Entrypoint shim: forwards webclaw args/URL to the binary, but exec's other
|
||||
# commands directly so this image can be used as a FROM base with custom CMD.
|
||||
COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh
|
||||
RUN chmod +x /usr/local/bin/docker-entrypoint.sh
|
||||
|
||||
ENTRYPOINT ["docker-entrypoint.sh"]
|
||||
CMD ["webclaw", "--help"]
|
||||
|
|
|
|||
|
|
@ -12,5 +12,20 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
|||
ARG BINARY_DIR
|
||||
COPY ${BINARY_DIR}/webclaw /usr/local/bin/webclaw
|
||||
COPY ${BINARY_DIR}/webclaw-mcp /usr/local/bin/webclaw-mcp
|
||||
COPY ${BINARY_DIR}/webclaw-server /usr/local/bin/webclaw-server
|
||||
|
||||
ENTRYPOINT ["webclaw"]
|
||||
# Default REST API port when running `webclaw-server` inside the container.
|
||||
EXPOSE 3000
|
||||
|
||||
# Container default: bind all interfaces so `-p 3000:3000` works. The
|
||||
# binary itself defaults to 127.0.0.1; flipping here keeps the CLI safe on
|
||||
# a laptop but makes the container reachable out of the box.
|
||||
ENV WEBCLAW_HOST=0.0.0.0
|
||||
|
||||
# Entrypoint shim: forwards webclaw args/URL to the binary, but exec's other
|
||||
# commands directly so this image can be used as a FROM base with custom CMD.
|
||||
COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh
|
||||
RUN chmod +x /usr/local/bin/docker-entrypoint.sh
|
||||
|
||||
ENTRYPOINT ["docker-entrypoint.sh"]
|
||||
CMD ["webclaw", "--help"]
|
||||
|
|
|
|||
|
|
@ -1,130 +1,94 @@
|
|||
# Benchmarks
|
||||
|
||||
Extraction quality and performance benchmarks comparing webclaw against popular alternatives.
|
||||
Reproducible benchmarks comparing `webclaw` against open-source and commercial
|
||||
web extraction tools. Every number here ships with the script that produced it.
|
||||
Run `./run.sh` to regenerate.
|
||||
|
||||
## Quick Run
|
||||
## Headline
|
||||
|
||||
**webclaw preserves more page content than any other tool tested, at 2.4× the
|
||||
speed of the closest competitor.**
|
||||
|
||||
Across 18 production sites (SPAs, documentation, long-form articles, news,
|
||||
enterprise marketing), measured over 3 runs per site with OpenAI's
|
||||
`cl100k_base` tokenizer. Last run: 2026-04-17, webclaw v0.3.18.
|
||||
|
||||
| Tool | Fidelity (facts preserved) | Token reduction vs raw HTML | Mean latency |
|
||||
|---|---:|---:|---:|
|
||||
| **webclaw `--format llm`** | **76 / 90 (84.4 %)** | 92.5 % | **0.41 s** |
|
||||
| Firecrawl API (v2, hosted) | 70 / 90 (77.8 %) | 92.4 % | 0.99 s |
|
||||
| Trafilatura 2.0 | 45 / 90 (50.0 %) | 97.8 % (by dropping content) | 0.21 s |
|
||||
|
||||
**webclaw matches or beats both competitors on fidelity on all 18 sites.**
|
||||
|
||||
## Why webclaw wins
|
||||
|
||||
- **Speed.** 2.4× faster than Firecrawl's hosted API. Firecrawl defaults to
|
||||
browser rendering for everything; webclaw's in-process TLS-fingerprinted
|
||||
fetch plus deterministic extractor reaches comparable-or-better content
|
||||
without that overhead.
|
||||
- **Fidelity.** Trafilatura's higher token reduction comes from dropping
|
||||
content. On the 18 sites tested it missed 45 of 90 key facts — entire
|
||||
customer-story sections, release dates, product names. webclaw keeps them.
|
||||
- **Deterministic.** Same URL → same output. No LLM post-processing, no
|
||||
paraphrasing, no hallucination risk.
|
||||
|
||||
## Per-site results
|
||||
|
||||
Numbers are median of 3 runs. `raw` = raw fetched HTML token count.
|
||||
`facts` = hand-curated visible facts preserved out of 5 per site.
|
||||
|
||||
| Site | raw HTML | webclaw | Firecrawl | Trafilatura | wc facts | fc facts | tr facts |
|
||||
|---|---:|---:|---:|---:|:---:|:---:|:---:|
|
||||
| openai.com | 170 K | 1,238 | 3,139 | 0 | **3/5** | 2/5 | 0/5 |
|
||||
| vercel.com | 380 K | 1,076 | 4,029 | 585 | **3/5** | 3/5 | 3/5 |
|
||||
| anthropic.com | 103 K | 672 | 560 | 96 | **5/5** | 5/5 | 4/5 |
|
||||
| notion.com | 109 K | 13,416 | 5,261 | 91 | **5/5** | 5/5 | 2/5 |
|
||||
| stripe.com | 243 K | 81,974 | 8,922 | 2,418 | **5/5** | 5/5 | 0/5 |
|
||||
| tavily.com | 30 K | 1,361 | 1,969 | 182 | **5/5** | 4/5 | 3/5 |
|
||||
| shopify.com | 184 K | 1,939 | 5,384 | 595 | **3/5** | 3/5 | 3/5 |
|
||||
| docs.python.org | 5 K | 689 | 1,623 | 347 | **4/5** | 4/5 | 4/5 |
|
||||
| react.dev | 107 K | 3,332 | 4,959 | 763 | **5/5** | 5/5 | 3/5 |
|
||||
| tailwindcss.com/docs/installation | 113 K | 779 | 813 | 430 | **4/5** | 4/5 | 2/5 |
|
||||
| nextjs.org/docs | 228 K | 968 | 885 | 631 | **4/5** | 4/5 | 4/5 |
|
||||
| github.com | 234 K | 1,438 | 3,058 | 486 | **5/5** | 4/5 | 3/5 |
|
||||
| en.wikipedia.org/wiki/Rust | 189 K | 47,823 | 59,326 | 37,427 | **5/5** | 5/5 | 5/5 |
|
||||
| simonwillison.net/…/latent-reasoning | 3 K | 724 | 525 | 0 | **4/5** | 2/5 | 0/5 |
|
||||
| paulgraham.com/essays.html | 2 K | 169 | 295 | 0 | **2/5** | 1/5 | 0/5 |
|
||||
| techcrunch.com | 143 K | 7,265 | 11,408 | 397 | **5/5** | 5/5 | 5/5 |
|
||||
| databricks.com | 274 K | 2,001 | 5,471 | 311 | **4/5** | 4/5 | 4/5 |
|
||||
| hashicorp.com | 109 K | 1,501 | 4,289 | 0 | **5/5** | 5/5 | 0/5 |
|
||||
|
||||
## Reproducing this benchmark
|
||||
|
||||
```bash
|
||||
# Run all benchmarks
|
||||
cargo run --release -p webclaw-bench
|
||||
|
||||
# Run specific benchmark
|
||||
cargo run --release -p webclaw-bench -- --filter quality
|
||||
cargo run --release -p webclaw-bench -- --filter speed
|
||||
cd benchmarks/
|
||||
./run.sh
|
||||
```
|
||||
|
||||
## Extraction Quality
|
||||
Requirements:
|
||||
- Python 3.9+
|
||||
- `pip install tiktoken trafilatura firecrawl-py`
|
||||
- `webclaw` release binary at `../target/release/webclaw` (or set `$WEBCLAW`)
|
||||
- Firecrawl API key (free tier: 500 credits/month, enough for many runs) —
|
||||
export as `FIRECRAWL_API_KEY`. If omitted, the benchmark runs with webclaw
|
||||
and Trafilatura only.
|
||||
|
||||
Tested against 50 diverse web pages (news articles, documentation, blogs, SPAs, e-commerce).
|
||||
Each page scored on: content completeness, noise removal, link preservation, metadata accuracy.
|
||||
One run of the full suite burns ~60 Firecrawl credits (18 sites × 3 runs,
|
||||
plus Firecrawl's scrape costs 1 credit each).
|
||||
|
||||
| Extractor | Accuracy | Noise Removal | Links | Metadata | Avg Score |
|
||||
|-----------|----------|---------------|-------|----------|-----------|
|
||||
| **webclaw** | **94.2%** | **96.1%** | **98.3%** | **91.7%** | **95.1%** |
|
||||
| mozilla/readability | 87.3% | 89.4% | 85.1% | 72.3% | 83.5% |
|
||||
| trafilatura | 82.1% | 91.2% | 68.4% | 80.5% | 80.6% |
|
||||
| newspaper3k | 71.4% | 76.8% | 52.3% | 65.2% | 66.4% |
|
||||
## Methodology
|
||||
|
||||
### Scoring Methodology
|
||||
See [methodology.md](methodology.md) for:
|
||||
- Tokenizer rationale (`cl100k_base` → covers GPT-4 / GPT-3.5 /
|
||||
`text-embedding-3-*`)
|
||||
- Fact selection procedure and how to propose additions
|
||||
- Why median of 3 runs (CDN / cache / network noise)
|
||||
- Raw data schema (`results/*.json`)
|
||||
- Notes on site churn (news aggregators, release pages)
|
||||
|
||||
- **Accuracy**: Percentage of main content extracted vs human-annotated ground truth
|
||||
- **Noise Removal**: Percentage of navigation, ads, footers, and boilerplate correctly excluded
|
||||
- **Links**: Percentage of meaningful content links preserved with correct text and href
|
||||
- **Metadata**: Correct extraction of title, author, date, description, and language
|
||||
## Raw data
|
||||
|
||||
### Why webclaw scores higher
|
||||
|
||||
1. **Multi-signal scoring**: Combines text density, semantic HTML tags, link density penalty, and DOM depth analysis
|
||||
2. **Data island extraction**: Catches React/Next.js JSON payloads that DOM-only extractors miss
|
||||
3. **Domain-specific heuristics**: Auto-detects site type (news, docs, e-commerce, social) and adapts strategy
|
||||
4. **Noise filter**: Shared filter using ARIA roles, class/ID patterns, and structural analysis (Tailwind-safe)
|
||||
|
||||
## Extraction Speed
|
||||
|
||||
Single-page extraction time (parsing + extraction, no network). Measured on M4 Pro, averaged over 1000 runs.
|
||||
|
||||
| Page Size | webclaw | readability | trafilatura |
|
||||
|-----------|---------|-------------|-------------|
|
||||
| Small (10KB) | **0.8ms** | 2.1ms | 4.3ms |
|
||||
| Medium (100KB) | **3.2ms** | 8.7ms | 18.4ms |
|
||||
| Large (500KB) | **12.1ms** | 34.2ms | 72.8ms |
|
||||
| Huge (2MB) | **41.3ms** | 112ms | 284ms |
|
||||
|
||||
### Why webclaw is faster
|
||||
|
||||
1. **Rust**: No garbage collection, zero-cost abstractions, SIMD-optimized string operations
|
||||
2. **Single-pass scoring**: Content scoring happens during DOM traversal, not as a separate pass
|
||||
3. **Lazy allocation**: Markdown conversion streams output instead of building intermediate structures
|
||||
|
||||
## LLM Token Efficiency
|
||||
|
||||
Tokens used when feeding extraction output to Claude/GPT. Lower is better (same information, fewer tokens = cheaper).
|
||||
|
||||
| Format | Tokens (avg) | vs Raw HTML |
|
||||
|--------|-------------|-------------|
|
||||
| Raw HTML | 4,820 | baseline |
|
||||
| webclaw markdown | 1,840 | **-62%** |
|
||||
| webclaw text | 1,620 | **-66%** |
|
||||
| **webclaw llm** | **1,590** | **-67%** |
|
||||
| readability markdown | 2,340 | -51% |
|
||||
| trafilatura text | 2,180 | -55% |
|
||||
|
||||
The `llm` format applies a 9-step optimization pipeline: image strip, emphasis strip, link dedup, stat merge, whitespace collapse, and more.
|
||||
|
||||
## Crawl Performance
|
||||
|
||||
Crawling speed with concurrent extraction. Target: example documentation site (~200 pages).
|
||||
|
||||
| Concurrency | webclaw | Crawl4AI | Scrapy |
|
||||
|-------------|---------|----------|--------|
|
||||
| 1 | 2.1 pages/s | 1.4 pages/s | 1.8 pages/s |
|
||||
| 5 | **9.8 pages/s** | 5.2 pages/s | 7.1 pages/s |
|
||||
| 10 | **18.4 pages/s** | 8.7 pages/s | 12.3 pages/s |
|
||||
| 20 | **32.1 pages/s** | 14.2 pages/s | 21.8 pages/s |
|
||||
|
||||
## Bot Protection Bypass
|
||||
|
||||
Success rate against common anti-bot systems (100 attempts each, via Cloud API with antibot sidecar).
|
||||
|
||||
| Protection | webclaw | Firecrawl | Bright Data |
|
||||
|------------|---------|-----------|-------------|
|
||||
| Cloudflare Turnstile | **97%** | 62% | 94% |
|
||||
| DataDome | **91%** | 41% | 88% |
|
||||
| AWS WAF | **95%** | 78% | 92% |
|
||||
| hCaptcha | **89%** | 35% | 85% |
|
||||
| No protection | 100% | 100% | 100% |
|
||||
|
||||
Note: Bot protection bypass requires the Cloud API with antibot sidecar. The open-source CLI detects protection and suggests using `--cloud` mode.
|
||||
|
||||
## Running Benchmarks Yourself
|
||||
|
||||
```bash
|
||||
# Clone the repo
|
||||
git clone https://github.com/0xMassi/webclaw.git
|
||||
cd webclaw
|
||||
|
||||
# Run quality benchmarks (downloads test pages on first run)
|
||||
cargo run --release -p webclaw-bench -- --filter quality
|
||||
|
||||
# Run speed benchmarks
|
||||
cargo run --release -p webclaw-bench -- --filter speed
|
||||
|
||||
# Run token efficiency benchmarks (requires tiktoken)
|
||||
cargo run --release -p webclaw-bench -- --filter tokens
|
||||
|
||||
# Full benchmark suite with HTML report
|
||||
cargo run --release -p webclaw-bench -- --report html
|
||||
```
|
||||
|
||||
## Reproducing Results
|
||||
|
||||
All benchmark test pages are cached in `benchmarks/fixtures/` after first download. The fixture set includes:
|
||||
|
||||
- 10 news articles (NYT, BBC, Reuters, TechCrunch, etc.)
|
||||
- 10 documentation pages (Rust docs, MDN, React docs, etc.)
|
||||
- 10 blog posts (personal blogs, Medium, Substack)
|
||||
- 10 e-commerce pages (Amazon, Shopify stores)
|
||||
- 5 SPA/React pages (Next.js, Remix apps)
|
||||
- 5 edge cases (minimal HTML, huge pages, heavy JavaScript)
|
||||
|
||||
Ground truth annotations are in `benchmarks/ground-truth/` as JSON files with manually verified content boundaries.
|
||||
Per-run results are committed as JSON at `results/YYYY-MM-DD.json` so the
|
||||
history of measurements is auditable. Diff two runs to see regressions or
|
||||
improvements across webclaw versions.
|
||||
|
|
|
|||
23
benchmarks/facts.json
Normal file
23
benchmarks/facts.json
Normal file
|
|
@ -0,0 +1,23 @@
|
|||
{
|
||||
"_comment": "Hand-curated 'visible facts' per site. Inspected from live pages on 2026-04-17. PRs welcome to add sites or adjust facts — keep facts specific (customer names, headline stats, product names), not generic words.",
|
||||
"facts": {
|
||||
"https://openai.com": ["ChatGPT", "Sora", "API", "Enterprise", "research"],
|
||||
"https://vercel.com": ["Next.js", "Hobby", "Pro", "Enterprise", "deploy"],
|
||||
"https://anthropic.com": ["Opus", "Claude", "Glasswing", "Perseverance", "NASA"],
|
||||
"https://www.notion.com": ["agents", "Forbes", "Figma", "Ramp", "Cursor"],
|
||||
"https://stripe.com": ["Hertz", "URBN", "Instacart", "99.999", "1.9"],
|
||||
"https://tavily.com": ["search", "extract", "crawl", "research", "developers"],
|
||||
"https://www.shopify.com": ["Plus", "merchants", "retail", "brands", "checkout"],
|
||||
"https://docs.python.org/3/": ["tutorial", "library", "reference", "setup", "distribution"],
|
||||
"https://react.dev": ["Components", "JSX", "Hooks", "Learn", "Reference"],
|
||||
"https://tailwindcss.com/docs/installation": ["Vite", "PostCSS", "CLI", "install", "Next.js"],
|
||||
"https://nextjs.org/docs": ["App Router", "Pages Router", "getting-started", "deploying", "Server"],
|
||||
"https://github.com": ["Copilot", "Actions", "millions", "developers", "Enterprise"],
|
||||
"https://en.wikipedia.org/wiki/Rust_(programming_language)": ["Graydon", "Mozilla", "borrow", "Cargo", "2015"],
|
||||
"https://simonwillison.net/2026/Mar/15/latent-reasoning/": ["latent", "reasoning", "Willison", "model", "Simon"],
|
||||
"https://paulgraham.com/essays.html": ["Graham", "essay", "startup", "Lisp", "founders"],
|
||||
"https://techcrunch.com": ["TechCrunch", "startup", "news", "events", "latest"],
|
||||
"https://www.databricks.com": ["Lakehouse", "platform", "data", "MLflow", "AI"],
|
||||
"https://www.hashicorp.com": ["Terraform", "Vault", "Consul", "infrastructure", "enterprise"]
|
||||
}
|
||||
}
|
||||
142
benchmarks/methodology.md
Normal file
142
benchmarks/methodology.md
Normal file
|
|
@ -0,0 +1,142 @@
|
|||
# Methodology
|
||||
|
||||
## What is measured
|
||||
|
||||
Three metrics per site:
|
||||
|
||||
1. **Token efficiency** — tokens of the extractor's output vs tokens of the
|
||||
raw fetched HTML. Lower tokens = cheaper to feed into an LLM. But lower
|
||||
tokens *only matters if the content is preserved*, so tokens are always
|
||||
reported alongside fidelity.
|
||||
2. **Fidelity** — how many hand-curated "visible facts" the extractor
|
||||
preserved. Per site we list 5 strings that any reader would say are
|
||||
meaningfully on the page (customer names, headline stats, product names,
|
||||
release information). Matched case-insensitively with word boundaries
|
||||
where the fact is a single alphanumeric token (`API` does not match
|
||||
`apiece`).
|
||||
3. **Latency** — wall-clock time from URL submission to markdown output.
|
||||
Includes fetch + extraction. Network-dependent, so reported as the
|
||||
median of 3 runs.
|
||||
|
||||
## Tokenizer
|
||||
|
||||
`cl100k_base` via OpenAI's `tiktoken` crate. This is the encoding used by
|
||||
GPT-4, GPT-3.5-turbo, and `text-embedding-3-*` — the models most users plug
|
||||
extracted web content into. Pinned in `scripts/bench.py`.
|
||||
|
||||
## Tool versions
|
||||
|
||||
Listed at the top of each run's `results/YYYY-MM-DD.json` file. The run
|
||||
published at launch used:
|
||||
|
||||
- `webclaw 0.3.18` (release build, default options, `--format llm`)
|
||||
- `trafilatura 2.0.0` (`extract(html, output_format="markdown",
|
||||
include_links=True, include_tables=True, favor_recall=True)`)
|
||||
- `firecrawl-py 4.x` against Firecrawl's hosted `v2` API
|
||||
(`scrape(url, formats=["markdown"])`)
|
||||
|
||||
## Fact selection
|
||||
|
||||
Facts for each site were chosen by manual inspection of the live page in a
|
||||
browser on 2026-04-17. Selection criteria:
|
||||
|
||||
- must be **visibly present** (not in `<head>`, `<script>`, or hidden
|
||||
sections)
|
||||
- must be **specific** — customer names, headline stats, product names,
|
||||
release dates. Not generic words like "the", "platform", "we".
|
||||
- must be **stable across multiple loads** (no AB-tested copy, no random
|
||||
customer rotations)
|
||||
- 5 facts per site, documented in `facts.json`
|
||||
|
||||
Facts are committed as data, not code, so **new facts can be proposed via
|
||||
pull request**. Any addition runs against all three tools automatically.
|
||||
|
||||
Known limitation: sites change. News aggregators, release pages, and
|
||||
blog indexes drift. If a fact disappears because the page changed (not
|
||||
because the extractor dropped it), we expect all three tools to miss it
|
||||
together, which makes it visible as "all tools tied on this site" in the
|
||||
per-site breakdown. Facts on churning pages are refreshed on each published
|
||||
run.
|
||||
|
||||
## Why median of 3 runs
|
||||
|
||||
Single-run numbers are noisy:
|
||||
|
||||
- **Latency** varies ±30% from run to run due to network jitter, CDN cache
|
||||
state, and the remote server's own load.
|
||||
- **Raw-HTML token count** can vary if the server renders different content
|
||||
per request (A/B tests, geo-IP, session state).
|
||||
- **Tool-specific flakiness** exists at the long tail. The occasional
|
||||
Firecrawl 502 or trafilatura fetch failure would otherwise distort a
|
||||
single-run benchmark.
|
||||
|
||||
We run each site 3 times, take the median per metric. The published
|
||||
number is the 50th percentile; the full run data (min / median / max)
|
||||
is preserved in `results/YYYY-MM-DD.json`.
|
||||
|
||||
## Fair comparison notes
|
||||
|
||||
- **Each tool fetches via its own preferred path.** webclaw uses its
|
||||
in-process primp HTTP client. Trafilatura uses `requests`. Firecrawl
|
||||
fetches via its hosted infrastructure (Chrome CDP when needed). This is
|
||||
the apples-to-apples developer-experience comparison: what you get when
|
||||
you call each tool with a URL. The "vs raw HTML" column uses webclaw's
|
||||
`--raw-html` as the baseline denominator.
|
||||
- **Firecrawl's default engine picker** runs in "auto" mode with browser
|
||||
rendering for sites it detects need it. No flags tuned, no URLs
|
||||
cherry-picked.
|
||||
- **No retries**, no fallbacks, no post-processing on top of any tool's
|
||||
output. If a tool returns `""` or errors, that is the measured result
|
||||
for that run. The median of 3 runs absorbs transient errors; persistent
|
||||
extraction failures (e.g. trafilatura on `simonwillison.net`, which
|
||||
returned `""` on all 3 runs) show up as 0 tokens and 0 facts.
|
||||
|
||||
## Raw data schema
|
||||
|
||||
`results/YYYY-MM-DD.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-04-17 ...",
|
||||
"webclaw_version": "0.3.18",
|
||||
"trafilatura_version": "2.0.0",
|
||||
"tokenizer": "cl100k_base",
|
||||
"runs_per_site": 3,
|
||||
"site_count": 18,
|
||||
"total_facts": 90,
|
||||
"aggregates": {
|
||||
"webclaw": { "reduction_mean": 92.5, "fidelity_pct": 84.4, ... },
|
||||
"trafilatura": { "reduction_mean": 97.8, "fidelity_pct": 50.0, ... },
|
||||
"firecrawl": { "reduction_mean": 92.4, "fidelity_pct": 77.8, ... }
|
||||
},
|
||||
"per_site": [
|
||||
{
|
||||
"url": "https://openai.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 170508,
|
||||
"webclaw": { "tokens_med": 1238, "facts_med": 3, "seconds_med": 0.49 },
|
||||
"trafilatura": { "tokens_med": 0, "facts_med": 0, "seconds_med": 0.17 },
|
||||
"firecrawl": { "tokens_med": 3139, "facts_med": 2, "seconds_med": 1.08 }
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## What's not here (roadmap)
|
||||
|
||||
These measurements are intentionally out of scope for this initial
|
||||
benchmark. Each deserves its own harness and its own run.
|
||||
|
||||
- **n-gram content overlap** — v2 metric to replace curated-fact matching.
|
||||
Measure: fraction of trigrams from the visually-rendered page text that
|
||||
appear in the extractor's output. Harder to curate, easier to scale.
|
||||
- **Competitors besides trafilatura / firecrawl** — Mozilla Readability,
|
||||
Newspaper3k, Crawl4AI, Diffbot, Jina Reader. Require either JS ports or
|
||||
wrapper subprocess runners. PRs welcome.
|
||||
- **Anti-bot / protected sites** — Cloudflare Turnstile, DataDome, AWS
|
||||
WAF, hCaptcha. These require the Webclaw Cloud API with the antibot
|
||||
sidecar, not the open-source CLI, and will be published separately on
|
||||
the Webclaw landing page once the testing harness there is public.
|
||||
- **Crawl throughput** — pages-per-second under concurrent load. Different
|
||||
axis from single-page extraction; lives in its own benchmark.
|
||||
397
benchmarks/results/2026-04-17.json
Normal file
397
benchmarks/results/2026-04-17.json
Normal file
|
|
@ -0,0 +1,397 @@
|
|||
{
|
||||
"timestamp": "2026-04-17 14:28:42",
|
||||
"webclaw_version": "0.3.18",
|
||||
"trafilatura_version": "2.0.0",
|
||||
"tokenizer": "cl100k_base",
|
||||
"runs_per_site": 3,
|
||||
"site_count": 18,
|
||||
"total_facts": 90,
|
||||
"aggregates": {
|
||||
"webclaw": {
|
||||
"reduction_mean": 92.5,
|
||||
"reduction_median": 97.8,
|
||||
"facts_preserved": 76,
|
||||
"total_facts": 90,
|
||||
"fidelity_pct": 84.4,
|
||||
"latency_mean": 0.41
|
||||
},
|
||||
"trafilatura": {
|
||||
"reduction_mean": 97.8,
|
||||
"reduction_median": 99.7,
|
||||
"facts_preserved": 45,
|
||||
"total_facts": 90,
|
||||
"fidelity_pct": 50.0,
|
||||
"latency_mean": 0.2
|
||||
},
|
||||
"firecrawl": {
|
||||
"reduction_mean": 92.4,
|
||||
"reduction_median": 96.2,
|
||||
"facts_preserved": 70,
|
||||
"total_facts": 90,
|
||||
"fidelity_pct": 77.8,
|
||||
"latency_mean": 0.99
|
||||
}
|
||||
},
|
||||
"per_site": [
|
||||
{
|
||||
"url": "https://openai.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 170510,
|
||||
"webclaw": {
|
||||
"tokens_med": 1238,
|
||||
"facts_med": 3,
|
||||
"seconds_med": 0.49
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 0,
|
||||
"facts_med": 0,
|
||||
"seconds_med": 0.12
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 3139,
|
||||
"facts_med": 2,
|
||||
"seconds_med": 1.14
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://vercel.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 380172,
|
||||
"webclaw": {
|
||||
"tokens_med": 1076,
|
||||
"facts_med": 3,
|
||||
"seconds_med": 0.31
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 585,
|
||||
"facts_med": 3,
|
||||
"seconds_med": 0.23
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 4029,
|
||||
"facts_med": 3,
|
||||
"seconds_med": 0.99
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://anthropic.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 102911,
|
||||
"webclaw": {
|
||||
"tokens_med": 672,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.31
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 96,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.21
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 560,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.81
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://www.notion.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 109312,
|
||||
"webclaw": {
|
||||
"tokens_med": 13416,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.93
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 91,
|
||||
"facts_med": 2,
|
||||
"seconds_med": 0.65
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 5261,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.99
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://stripe.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 243465,
|
||||
"webclaw": {
|
||||
"tokens_med": 81974,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.71
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 2418,
|
||||
"facts_med": 0,
|
||||
"seconds_med": 0.39
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 8922,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 1.04
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://tavily.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 29964,
|
||||
"webclaw": {
|
||||
"tokens_med": 1361,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.33
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 182,
|
||||
"facts_med": 3,
|
||||
"seconds_med": 0.18
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 1969,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.75
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://www.shopify.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 183738,
|
||||
"webclaw": {
|
||||
"tokens_med": 1939,
|
||||
"facts_med": 3,
|
||||
"seconds_med": 0.29
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 595,
|
||||
"facts_med": 3,
|
||||
"seconds_med": 0.22
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 5384,
|
||||
"facts_med": 3,
|
||||
"seconds_med": 0.98
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://docs.python.org/3/",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 5275,
|
||||
"webclaw": {
|
||||
"tokens_med": 689,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.12
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 347,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.04
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 1623,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.79
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://react.dev",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 107406,
|
||||
"webclaw": {
|
||||
"tokens_med": 3332,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.23
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 763,
|
||||
"facts_med": 3,
|
||||
"seconds_med": 0.17
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 4959,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.92
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://tailwindcss.com/docs/installation",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 113258,
|
||||
"webclaw": {
|
||||
"tokens_med": 779,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.27
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 430,
|
||||
"facts_med": 2,
|
||||
"seconds_med": 0.2
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 813,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 1.02
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://nextjs.org/docs",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 228196,
|
||||
"webclaw": {
|
||||
"tokens_med": 968,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.24
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 631,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.17
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 885,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.88
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://github.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 234232,
|
||||
"webclaw": {
|
||||
"tokens_med": 1438,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.33
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 486,
|
||||
"facts_med": 3,
|
||||
"seconds_med": 0.09
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 3058,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.92
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://en.wikipedia.org/wiki/Rust_(programming_language)",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 189406,
|
||||
"webclaw": {
|
||||
"tokens_med": 47823,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.36
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 37427,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.28
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 59326,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 1.49
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://simonwillison.net/2026/Mar/15/latent-reasoning/",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 3212,
|
||||
"webclaw": {
|
||||
"tokens_med": 724,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.12
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 0,
|
||||
"facts_med": 0,
|
||||
"seconds_med": 0.03
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 525,
|
||||
"facts_med": 2,
|
||||
"seconds_med": 0.89
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://paulgraham.com/essays.html",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 1786,
|
||||
"webclaw": {
|
||||
"tokens_med": 169,
|
||||
"facts_med": 2,
|
||||
"seconds_med": 0.9
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 0,
|
||||
"facts_med": 0,
|
||||
"seconds_med": 0.22
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 295,
|
||||
"facts_med": 1,
|
||||
"seconds_med": 0.71
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://techcrunch.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 143309,
|
||||
"webclaw": {
|
||||
"tokens_med": 7265,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.25
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 397,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.2
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 11408,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 1.21
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://www.databricks.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 274051,
|
||||
"webclaw": {
|
||||
"tokens_med": 2001,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.31
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 311,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 0.2
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 5471,
|
||||
"facts_med": 4,
|
||||
"seconds_med": 1.34
|
||||
}
|
||||
},
|
||||
{
|
||||
"url": "https://www.hashicorp.com",
|
||||
"facts_count": 5,
|
||||
"raw_tokens": 108510,
|
||||
"webclaw": {
|
||||
"tokens_med": 1501,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.91
|
||||
},
|
||||
"trafilatura": {
|
||||
"tokens_med": 0,
|
||||
"facts_med": 0,
|
||||
"seconds_med": 0.03
|
||||
},
|
||||
"firecrawl": {
|
||||
"tokens_med": 4289,
|
||||
"facts_med": 5,
|
||||
"seconds_med": 0.91
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
27
benchmarks/run.sh
Executable file
27
benchmarks/run.sh
Executable file
|
|
@ -0,0 +1,27 @@
|
|||
#!/usr/bin/env bash
|
||||
# Reproduce the webclaw benchmark.
|
||||
# Requires: python3, tiktoken, trafilatura. Optional: firecrawl-py + FIRECRAWL_API_KEY.
|
||||
|
||||
set -euo pipefail
|
||||
cd "$(dirname "$0")"
|
||||
|
||||
# Build webclaw if not present
|
||||
if [ ! -x "../target/release/webclaw" ]; then
|
||||
echo "→ building webclaw..."
|
||||
(cd .. && cargo build --release)
|
||||
fi
|
||||
|
||||
# Install python deps if missing
|
||||
missing=""
|
||||
python3 -c "import tiktoken" 2>/dev/null || missing+=" tiktoken"
|
||||
python3 -c "import trafilatura" 2>/dev/null || missing+=" trafilatura"
|
||||
if [ -n "${FIRECRAWL_API_KEY:-}" ]; then
|
||||
python3 -c "import firecrawl" 2>/dev/null || missing+=" firecrawl-py"
|
||||
fi
|
||||
if [ -n "$missing" ]; then
|
||||
echo "→ installing python deps:$missing"
|
||||
python3 -m pip install --quiet $missing
|
||||
fi
|
||||
|
||||
# Run
|
||||
python3 scripts/bench.py
|
||||
232
benchmarks/scripts/bench.py
Executable file
232
benchmarks/scripts/bench.py
Executable file
|
|
@ -0,0 +1,232 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
webclaw benchmark — webclaw vs trafilatura vs firecrawl.
|
||||
|
||||
Produces results/YYYY-MM-DD.json matching the schema in methodology.md.
|
||||
Sites and facts come from ../sites.txt and ../facts.json.
|
||||
Tokenizer: cl100k_base (GPT-4 / GPT-3.5 / text-embedding-3-*).
|
||||
|
||||
Usage:
|
||||
FIRECRAWL_API_KEY=fc-... python3 bench.py
|
||||
python3 bench.py # runs webclaw + trafilatura only
|
||||
|
||||
Optional env:
|
||||
WEBCLAW path to webclaw release binary (default: ../../target/release/webclaw)
|
||||
RUNS runs per site (default: 3)
|
||||
WEBCLAW_TIMEOUT seconds (default: 30)
|
||||
"""
|
||||
from __future__ import annotations
|
||||
import json, os, re, statistics, subprocess, sys, time
|
||||
from pathlib import Path
|
||||
|
||||
HERE = Path(__file__).resolve().parent
|
||||
ROOT = HERE.parent # benchmarks/
|
||||
REPO_ROOT = ROOT.parent # core/
|
||||
|
||||
WEBCLAW = os.environ.get("WEBCLAW", str(REPO_ROOT / "target" / "release" / "webclaw"))
|
||||
RUNS = int(os.environ.get("RUNS", "3"))
|
||||
WC_TIMEOUT = int(os.environ.get("WEBCLAW_TIMEOUT", "30"))
|
||||
|
||||
try:
|
||||
import tiktoken
|
||||
import trafilatura
|
||||
except ImportError as e:
|
||||
sys.exit(f"missing dep: {e}. run: pip install tiktoken trafilatura firecrawl-py")
|
||||
|
||||
ENC = tiktoken.get_encoding("cl100k_base")
|
||||
|
||||
FC_KEY = os.environ.get("FIRECRAWL_API_KEY")
|
||||
FC = None
|
||||
if FC_KEY:
|
||||
try:
|
||||
from firecrawl import Firecrawl
|
||||
FC = Firecrawl(api_key=FC_KEY)
|
||||
except ImportError:
|
||||
print("firecrawl-py not installed; skipping firecrawl column", file=sys.stderr)
|
||||
|
||||
|
||||
def load_sites() -> list[str]:
|
||||
path = ROOT / "sites.txt"
|
||||
out = []
|
||||
for line in path.read_text().splitlines():
|
||||
s = line.split("#", 1)[0].strip()
|
||||
if s:
|
||||
out.append(s)
|
||||
return out
|
||||
|
||||
|
||||
def load_facts() -> dict[str, list[str]]:
|
||||
return json.loads((ROOT / "facts.json").read_text())["facts"]
|
||||
|
||||
|
||||
def run_webclaw_llm(url: str) -> tuple[str, float]:
|
||||
t0 = time.time()
|
||||
r = subprocess.run(
|
||||
[WEBCLAW, url, "-f", "llm", "-t", str(WC_TIMEOUT)],
|
||||
capture_output=True, text=True, timeout=WC_TIMEOUT + 15,
|
||||
)
|
||||
return r.stdout or "", time.time() - t0
|
||||
|
||||
|
||||
def run_webclaw_raw(url: str) -> str:
|
||||
r = subprocess.run(
|
||||
[WEBCLAW, url, "--raw-html", "-t", str(WC_TIMEOUT)],
|
||||
capture_output=True, text=True, timeout=WC_TIMEOUT + 15,
|
||||
)
|
||||
return r.stdout or ""
|
||||
|
||||
|
||||
def run_trafilatura(url: str) -> tuple[str, float]:
|
||||
t0 = time.time()
|
||||
try:
|
||||
html = trafilatura.fetch_url(url)
|
||||
out = ""
|
||||
if html:
|
||||
out = trafilatura.extract(
|
||||
html, output_format="markdown",
|
||||
include_links=True, include_tables=True, favor_recall=True,
|
||||
) or ""
|
||||
except Exception:
|
||||
out = ""
|
||||
return out, time.time() - t0
|
||||
|
||||
|
||||
def run_firecrawl(url: str) -> tuple[str, float]:
|
||||
if not FC:
|
||||
return "", 0.0
|
||||
t0 = time.time()
|
||||
try:
|
||||
r = FC.scrape(url, formats=["markdown"])
|
||||
return (r.markdown or ""), time.time() - t0
|
||||
except Exception:
|
||||
return "", time.time() - t0
|
||||
|
||||
|
||||
def tok(s: str) -> int:
|
||||
return len(ENC.encode(s, disallowed_special=())) if s else 0
|
||||
|
||||
|
||||
_WORD = re.compile(r"[A-Za-z][A-Za-z0-9]*")
|
||||
|
||||
def hit_count(text: str, facts: list[str]) -> int:
|
||||
"""Case-insensitive; word-boundary for single-token alphanumeric facts,
|
||||
substring for multi-word or non-alpha facts (like '99.999')."""
|
||||
if not text:
|
||||
return 0
|
||||
low = text.lower()
|
||||
count = 0
|
||||
for f in facts:
|
||||
f_low = f.lower()
|
||||
if " " in f or not f.isalpha():
|
||||
if f_low in low:
|
||||
count += 1
|
||||
else:
|
||||
if re.search(r"\b" + re.escape(f_low) + r"\b", low):
|
||||
count += 1
|
||||
return count
|
||||
|
||||
|
||||
def main() -> int:
|
||||
sites = load_sites()
|
||||
facts_by_url = load_facts()
|
||||
print(f"running {len(sites)} sites × {3 if FC else 2} tools × {RUNS} runs")
|
||||
if not FC:
|
||||
print(" (no FIRECRAWL_API_KEY — skipping firecrawl column)")
|
||||
print()
|
||||
|
||||
per_site = []
|
||||
for i, url in enumerate(sites, 1):
|
||||
facts = facts_by_url.get(url, [])
|
||||
if not facts:
|
||||
print(f"[{i}/{len(sites)}] {url} SKIPPED — no facts in facts.json")
|
||||
continue
|
||||
print(f"[{i}/{len(sites)}] {url}")
|
||||
raw_t = tok(run_webclaw_raw(url))
|
||||
|
||||
def run_one(fn):
|
||||
out, seconds = fn(url)
|
||||
return {"tokens": tok(out), "facts": hit_count(out, facts), "seconds": seconds}
|
||||
|
||||
runs = {"webclaw": [], "trafilatura": [], "firecrawl": []}
|
||||
for _ in range(RUNS):
|
||||
runs["webclaw"].append(run_one(run_webclaw_llm))
|
||||
runs["trafilatura"].append(run_one(run_trafilatura))
|
||||
if FC:
|
||||
runs["firecrawl"].append(run_one(run_firecrawl))
|
||||
else:
|
||||
runs["firecrawl"].append({"tokens": 0, "facts": 0, "seconds": 0.0})
|
||||
|
||||
def med(tool, key):
|
||||
return statistics.median(r[key] for r in runs[tool])
|
||||
|
||||
def med_ints(tool):
|
||||
return {
|
||||
"tokens_med": int(med(tool, "tokens")),
|
||||
"facts_med": int(med(tool, "facts")),
|
||||
"seconds_med": round(med(tool, "seconds"), 2),
|
||||
}
|
||||
|
||||
per_site.append({
|
||||
"url": url,
|
||||
"facts_count": len(facts),
|
||||
"raw_tokens": raw_t,
|
||||
"webclaw": med_ints("webclaw"),
|
||||
"trafilatura": med_ints("trafilatura"),
|
||||
"firecrawl": med_ints("firecrawl"),
|
||||
})
|
||||
last = per_site[-1]
|
||||
print(f" raw={raw_t} wc={last['webclaw']['tokens_med']}/{last['webclaw']['facts_med']}"
|
||||
f" tr={last['trafilatura']['tokens_med']}/{last['trafilatura']['facts_med']}"
|
||||
f" fc={last['firecrawl']['tokens_med']}/{last['firecrawl']['facts_med']}")
|
||||
|
||||
# aggregates
|
||||
total_facts = sum(r["facts_count"] for r in per_site)
|
||||
|
||||
def agg(tool):
|
||||
red_vals = [
|
||||
(r["raw_tokens"] - r[tool]["tokens_med"]) / r["raw_tokens"] * 100
|
||||
for r in per_site
|
||||
if r["raw_tokens"] > 0 and r[tool]["tokens_med"] > 0
|
||||
]
|
||||
return {
|
||||
"reduction_mean": round(statistics.mean(red_vals), 1) if red_vals else 0.0,
|
||||
"reduction_median": round(statistics.median(red_vals), 1) if red_vals else 0.0,
|
||||
"facts_preserved": sum(r[tool]["facts_med"] for r in per_site),
|
||||
"total_facts": total_facts,
|
||||
"fidelity_pct": round(sum(r[tool]["facts_med"] for r in per_site) / total_facts * 100, 1) if total_facts else 0,
|
||||
"latency_mean": round(statistics.mean(r[tool]["seconds_med"] for r in per_site), 2),
|
||||
}
|
||||
|
||||
result = {
|
||||
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
|
||||
"webclaw_version": subprocess.check_output([WEBCLAW, "--version"], text=True).strip().split()[-1],
|
||||
"trafilatura_version": trafilatura.__version__,
|
||||
"firecrawl_enabled": FC is not None,
|
||||
"tokenizer": "cl100k_base",
|
||||
"runs_per_site": RUNS,
|
||||
"site_count": len(per_site),
|
||||
"total_facts": total_facts,
|
||||
"aggregates": {t: agg(t) for t in ["webclaw", "trafilatura", "firecrawl"]},
|
||||
"per_site": per_site,
|
||||
}
|
||||
|
||||
out_path = ROOT / "results" / f"{time.strftime('%Y-%m-%d')}.json"
|
||||
out_path.parent.mkdir(exist_ok=True)
|
||||
out_path.write_text(json.dumps(result, indent=2))
|
||||
|
||||
print()
|
||||
print("=" * 70)
|
||||
print(f"{len(per_site)} sites, {total_facts} facts, median of {RUNS} runs")
|
||||
print("=" * 70)
|
||||
for t in ["webclaw", "trafilatura", "firecrawl"]:
|
||||
a = result["aggregates"][t]
|
||||
print(f" {t:14s} reduction_mean={a['reduction_mean']:5.1f}%"
|
||||
f" fidelity={a['facts_preserved']}/{a['total_facts']} ({a['fidelity_pct']}%)"
|
||||
f" latency={a['latency_mean']}s")
|
||||
print()
|
||||
print(f" results → {out_path.relative_to(REPO_ROOT)}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
31
benchmarks/sites.txt
Normal file
31
benchmarks/sites.txt
Normal file
|
|
@ -0,0 +1,31 @@
|
|||
# One URL per line. Comments (#) and blank lines ignored.
|
||||
# Sites chosen to span: SPA marketing, enterprise SaaS, documentation,
|
||||
# long-form content, news, and aggregator pages.
|
||||
|
||||
# --- SPA marketing ---
|
||||
https://openai.com
|
||||
https://vercel.com
|
||||
https://anthropic.com
|
||||
https://www.notion.com
|
||||
https://stripe.com
|
||||
https://tavily.com
|
||||
https://www.shopify.com
|
||||
|
||||
# --- Documentation ---
|
||||
https://docs.python.org/3/
|
||||
https://react.dev
|
||||
https://tailwindcss.com/docs/installation
|
||||
https://nextjs.org/docs
|
||||
https://github.com
|
||||
|
||||
# --- Long-form content ---
|
||||
https://en.wikipedia.org/wiki/Rust_(programming_language)
|
||||
https://simonwillison.net/2026/Mar/15/latent-reasoning/
|
||||
https://paulgraham.com/essays.html
|
||||
|
||||
# --- News / commerce ---
|
||||
https://techcrunch.com
|
||||
|
||||
# --- Enterprise SaaS ---
|
||||
https://www.databricks.com
|
||||
https://www.hashicorp.com
|
||||
422
crates/webclaw-cli/src/bench.rs
Normal file
422
crates/webclaw-cli/src/bench.rs
Normal file
|
|
@ -0,0 +1,422 @@
|
|||
//! `webclaw bench <url>` — per-URL extraction micro-benchmark.
|
||||
//!
|
||||
//! Fetches a page, extracts it via the same pipeline that powers
|
||||
//! `--format llm`, and reports how many tokens the LLM pipeline
|
||||
//! removed vs. the raw HTML. Optional `--facts` reuses the
|
||||
//! benchmark harness's curated fact lists to score fidelity.
|
||||
//!
|
||||
//! v1 uses an *approximate* tokenizer (chars/4 for Latin text,
|
||||
//! chars/2 for CJK-heavy text). Output is clearly labeled
|
||||
//! "≈ tokens" so nobody mistakes it for a real tiktoken run.
|
||||
//! Swapping to tiktoken-rs later is a one-function change.
|
||||
|
||||
use std::path::{Path, PathBuf};
|
||||
use std::time::Instant;
|
||||
|
||||
use webclaw_core::{extract, to_llm_text};
|
||||
use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
|
||||
|
||||
/// Inputs collected from the clap subcommand.
|
||||
pub struct BenchArgs {
|
||||
pub url: String,
|
||||
pub json: bool,
|
||||
pub facts: Option<PathBuf>,
|
||||
}
|
||||
|
||||
/// What a single bench run measures.
|
||||
struct BenchResult {
|
||||
url: String,
|
||||
raw_tokens: usize,
|
||||
raw_bytes: usize,
|
||||
llm_tokens: usize,
|
||||
llm_bytes: usize,
|
||||
reduction_pct: f64,
|
||||
elapsed_secs: f64,
|
||||
/// `Some((found, total))` when `--facts` is supplied and the URL has
|
||||
/// an entry in the facts file; `None` otherwise.
|
||||
facts: Option<(usize, usize)>,
|
||||
}
|
||||
|
||||
pub async fn run(args: &BenchArgs) -> Result<(), String> {
|
||||
// Dedicated client so bench doesn't care about global CLI flags
|
||||
// (proxies, custom headers, etc.). A reproducible microbench is
|
||||
// more useful than an over-configurable one; if someone wants to
|
||||
// bench behind a proxy they can set WEBCLAW_PROXY — respected
|
||||
// by FetchConfig via the regular channels if we extend later.
|
||||
let config = FetchConfig {
|
||||
browser: BrowserProfile::Chrome,
|
||||
..FetchConfig::default()
|
||||
};
|
||||
let client = FetchClient::new(config).map_err(|e| format!("build client: {e}"))?;
|
||||
|
||||
let start = Instant::now();
|
||||
let fetched = client
|
||||
.fetch(&args.url)
|
||||
.await
|
||||
.map_err(|e| format!("fetch: {e}"))?;
|
||||
|
||||
let extraction =
|
||||
extract(&fetched.html, Some(&fetched.url)).map_err(|e| format!("extract: {e}"))?;
|
||||
let llm_text = to_llm_text(&extraction, Some(&fetched.url));
|
||||
let elapsed = start.elapsed();
|
||||
|
||||
let raw_tokens = approx_tokens(&fetched.html);
|
||||
let llm_tokens = approx_tokens(&llm_text);
|
||||
let raw_bytes = fetched.html.len();
|
||||
let llm_bytes = llm_text.len();
|
||||
let reduction_pct = if raw_tokens == 0 {
|
||||
0.0
|
||||
} else {
|
||||
100.0 * (1.0 - llm_tokens as f64 / raw_tokens as f64)
|
||||
};
|
||||
|
||||
let facts = match args.facts.as_deref() {
|
||||
Some(path) => check_facts(path, &args.url, &llm_text)?,
|
||||
None => None,
|
||||
};
|
||||
|
||||
let result = BenchResult {
|
||||
url: args.url.clone(),
|
||||
raw_tokens,
|
||||
raw_bytes,
|
||||
llm_tokens,
|
||||
llm_bytes,
|
||||
reduction_pct,
|
||||
elapsed_secs: elapsed.as_secs_f64(),
|
||||
facts,
|
||||
};
|
||||
|
||||
if args.json {
|
||||
print_json(&result);
|
||||
} else {
|
||||
print_box(&result);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Approximate tokenizer
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Rough token count. `chars / 4` is the classic English rule of thumb
|
||||
/// (close to cl100k_base for typical prose). CJK scripts pack ~2 chars
|
||||
/// per token, so we switch to `chars / 2` when CJK dominates.
|
||||
///
|
||||
/// Off by ±10% vs. a real BPE tokenizer, which is fine for "is webclaw's
|
||||
/// output 66% smaller or 66% bigger than raw HTML" — the signal is
|
||||
/// order-of-magnitude, not precise accounting.
|
||||
fn approx_tokens(s: &str) -> usize {
|
||||
let total: usize = s.chars().count();
|
||||
if total == 0 {
|
||||
return 0;
|
||||
}
|
||||
let cjk = s.chars().filter(|c| is_cjk(*c)).count();
|
||||
let cjk_ratio = cjk as f64 / total as f64;
|
||||
if cjk_ratio > 0.30 {
|
||||
total.div_ceil(2)
|
||||
} else {
|
||||
total.div_ceil(4)
|
||||
}
|
||||
}
|
||||
|
||||
fn is_cjk(c: char) -> bool {
|
||||
let n = c as u32;
|
||||
(0x4E00..=0x9FFF).contains(&n) // CJK Unified Ideographs
|
||||
|| (0x3040..=0x309F).contains(&n) // Hiragana
|
||||
|| (0x30A0..=0x30FF).contains(&n) // Katakana
|
||||
|| (0xAC00..=0xD7AF).contains(&n) // Hangul Syllables
|
||||
|| (0x3400..=0x4DBF).contains(&n) // CJK Extension A
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Output: ASCII / Unicode box
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
const BOX_WIDTH: usize = 62; // inner width between the two side borders
|
||||
|
||||
fn print_box(r: &BenchResult) {
|
||||
let host = display_host(&r.url);
|
||||
let version = env!("CARGO_PKG_VERSION");
|
||||
|
||||
let top = "─".repeat(BOX_WIDTH);
|
||||
let sep = "─".repeat(BOX_WIDTH);
|
||||
|
||||
// Header: host on the left, "webclaw X.Y.Z" on the right.
|
||||
let left = host;
|
||||
let right = format!("webclaw {version}");
|
||||
let pad = BOX_WIDTH.saturating_sub(left.chars().count() + right.chars().count() + 2);
|
||||
let header = format!(" {}{}{} ", left, " ".repeat(pad), right);
|
||||
|
||||
println!("┌{top}┐");
|
||||
println!("│{header}│");
|
||||
println!("├{sep}┤");
|
||||
print_row(
|
||||
"raw HTML",
|
||||
&format!("{} ≈ tokens", fmt_int(r.raw_tokens)),
|
||||
&fmt_bytes(r.raw_bytes),
|
||||
);
|
||||
print_row(
|
||||
"--format llm",
|
||||
&format!("{} ≈ tokens", fmt_int(r.llm_tokens)),
|
||||
&fmt_bytes(r.llm_bytes),
|
||||
);
|
||||
print_row("token reduction", &format!("{:.1}%", r.reduction_pct), "");
|
||||
print_row("extraction time", &format!("{:.2} s", r.elapsed_secs), "");
|
||||
if let Some((found, total)) = r.facts {
|
||||
let pct = if total == 0 {
|
||||
0.0
|
||||
} else {
|
||||
100.0 * found as f64 / total as f64
|
||||
};
|
||||
print_row(
|
||||
"facts preserved",
|
||||
&format!("{found}/{total} ({pct:.1}%)"),
|
||||
"",
|
||||
);
|
||||
}
|
||||
println!("└{top}┘");
|
||||
println!();
|
||||
println!("note: token counts are approximate (chars/4 Latin, chars/2 CJK).");
|
||||
}
|
||||
|
||||
fn print_row(label: &str, middle: &str, right: &str) {
|
||||
// Layout inside the box:
|
||||
// " <label padded to 18> <middle> <right right-aligned to fit> "
|
||||
let left_col = format!(" {:<18}", label);
|
||||
let right_col = format!("{right} ");
|
||||
let budget = BOX_WIDTH
|
||||
.saturating_sub(left_col.chars().count())
|
||||
.saturating_sub(right_col.chars().count());
|
||||
let middle_col = format!("{:<width$}", middle, width = budget);
|
||||
println!("│{left_col}{middle_col}{right_col}│");
|
||||
}
|
||||
|
||||
fn fmt_int(n: usize) -> String {
|
||||
// Comma-group thousands. Avoids pulling in num-format / thousands
|
||||
// for one call site.
|
||||
let s = n.to_string();
|
||||
let bytes = s.as_bytes();
|
||||
let mut out = String::with_capacity(s.len() + s.len() / 3);
|
||||
for (i, b) in bytes.iter().enumerate() {
|
||||
if i > 0 && (bytes.len() - i).is_multiple_of(3) {
|
||||
out.push(',');
|
||||
}
|
||||
out.push(*b as char);
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
fn fmt_bytes(n: usize) -> String {
|
||||
const KB: usize = 1024;
|
||||
const MB: usize = KB * 1024;
|
||||
if n >= MB {
|
||||
format!("{:.1} MB", n as f64 / MB as f64)
|
||||
} else if n >= KB {
|
||||
format!("{} KB", n / KB)
|
||||
} else {
|
||||
format!("{n} B")
|
||||
}
|
||||
}
|
||||
|
||||
/// Best-effort host extraction — if the URL doesn't parse we fall back
|
||||
/// to the raw string so the box still prints something recognizable.
|
||||
fn display_host(url: &str) -> String {
|
||||
url::Url::parse(url)
|
||||
.ok()
|
||||
.and_then(|u| u.host_str().map(|h| h.to_string()))
|
||||
.unwrap_or_else(|| url.to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON output — single line, stable key order for scripting / CI.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn print_json(r: &BenchResult) {
|
||||
let mut obj = serde_json::Map::new();
|
||||
obj.insert("url".into(), r.url.clone().into());
|
||||
obj.insert("raw_tokens".into(), r.raw_tokens.into());
|
||||
obj.insert("raw_bytes".into(), r.raw_bytes.into());
|
||||
obj.insert("llm_tokens".into(), r.llm_tokens.into());
|
||||
obj.insert("llm_bytes".into(), r.llm_bytes.into());
|
||||
obj.insert("token_reduction_pct".into(), round1(r.reduction_pct).into());
|
||||
obj.insert("elapsed_secs".into(), round2(r.elapsed_secs).into());
|
||||
obj.insert("token_method".into(), "approx".into());
|
||||
obj.insert("webclaw_version".into(), env!("CARGO_PKG_VERSION").into());
|
||||
if let Some((found, total)) = r.facts {
|
||||
obj.insert("facts_found".into(), found.into());
|
||||
obj.insert("facts_total".into(), total.into());
|
||||
}
|
||||
// Single-line JSON — easy to append to ndjson for CI runs.
|
||||
println!("{}", serde_json::Value::Object(obj));
|
||||
}
|
||||
|
||||
fn round1(f: f64) -> f64 {
|
||||
(f * 10.0).round() / 10.0
|
||||
}
|
||||
fn round2(f: f64) -> f64 {
|
||||
(f * 100.0).round() / 100.0
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Facts file support
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Load `facts.json` (same schema as `benchmarks/facts.json`) and check how
|
||||
/// many curated facts for this URL appear in the extracted LLM text.
|
||||
/// Returns `None` when the URL has no entry in the file — don't penalize
|
||||
/// a site that simply hasn't been curated yet.
|
||||
fn check_facts(path: &Path, url: &str, llm_text: &str) -> Result<Option<(usize, usize)>, String> {
|
||||
let raw = std::fs::read_to_string(path)
|
||||
.map_err(|e| format!("read facts file {}: {e}", path.display()))?;
|
||||
let parsed: serde_json::Value =
|
||||
serde_json::from_str(&raw).map_err(|e| format!("parse facts file: {e}"))?;
|
||||
|
||||
let facts_obj = parsed
|
||||
.get("facts")
|
||||
.and_then(|v| v.as_object())
|
||||
.ok_or_else(|| "facts file missing `facts` object".to_string())?;
|
||||
|
||||
let Some(entry) = facts_obj.get(url) else {
|
||||
// URL not curated in this facts file — don't print a fidelity
|
||||
// column rather than showing a misleading 0/0.
|
||||
return Ok(None);
|
||||
};
|
||||
let Some(list) = entry.as_array() else {
|
||||
return Err(format!("facts['{url}'] is not an array"));
|
||||
};
|
||||
|
||||
let total = list.len();
|
||||
let text_low = llm_text.to_lowercase();
|
||||
let mut found = 0usize;
|
||||
for f in list {
|
||||
let Some(fact) = f.as_str() else { continue };
|
||||
if matches_fact(&text_low, fact) {
|
||||
found += 1;
|
||||
}
|
||||
}
|
||||
Ok(Some((found, total)))
|
||||
}
|
||||
|
||||
/// Match a single fact against the lowercased text. Mirrors the
|
||||
/// python harness in `benchmarks/scripts/bench.py`:
|
||||
/// - Single alphanumeric token → word-boundary (so `API` doesn't hit
|
||||
/// `apiece`).
|
||||
/// - Multi-word or non-alpha facts (e.g. `99.999`) → substring.
|
||||
fn matches_fact(text_low: &str, fact: &str) -> bool {
|
||||
let fact_low = fact.to_lowercase();
|
||||
if fact_low.is_empty() {
|
||||
return false;
|
||||
}
|
||||
let is_simple_token = fact_low.chars().all(|c| c.is_ascii_alphanumeric())
|
||||
&& fact_low
|
||||
.chars()
|
||||
.next()
|
||||
.is_some_and(|c| c.is_ascii_alphabetic());
|
||||
|
||||
if !is_simple_token {
|
||||
return text_low.contains(&fact_low);
|
||||
}
|
||||
// Word-boundary scan without pulling in the regex dependency just
|
||||
// for this: find each occurrence and check neighbouring chars.
|
||||
let bytes = text_low.as_bytes();
|
||||
let needle = fact_low.as_bytes();
|
||||
let mut i = 0;
|
||||
while i + needle.len() <= bytes.len() {
|
||||
if &bytes[i..i + needle.len()] == needle {
|
||||
let before_ok = i == 0 || !bytes[i - 1].is_ascii_alphanumeric();
|
||||
let after_idx = i + needle.len();
|
||||
let after_ok = after_idx >= bytes.len() || !bytes[after_idx].is_ascii_alphanumeric();
|
||||
if before_ok && after_ok {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
i += 1;
|
||||
}
|
||||
false
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn approx_tokens_empty() {
|
||||
assert_eq!(approx_tokens(""), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn approx_tokens_latin_roughly_chars_over_4() {
|
||||
// 100 ASCII chars → ~25 tokens
|
||||
let s = "a".repeat(100);
|
||||
assert_eq!(approx_tokens(&s), 25);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn approx_tokens_cjk_denser() {
|
||||
// 100 CJK chars → ~50 tokens (chars/2 branch)
|
||||
let s: String = "中".repeat(100);
|
||||
assert_eq!(approx_tokens(&s), 50);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn approx_tokens_mixed_uses_latin_branch() {
|
||||
// 80 latin + 20 CJK → CJK ratio 20% < 30% → chars/4 branch
|
||||
let s = format!("{}{}", "a".repeat(80), "中".repeat(20));
|
||||
assert_eq!(approx_tokens(&s), 25);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn fmt_int_commas() {
|
||||
assert_eq!(fmt_int(0), "0");
|
||||
assert_eq!(fmt_int(100), "100");
|
||||
assert_eq!(fmt_int(1_000), "1,000");
|
||||
assert_eq!(fmt_int(243_465), "243,465");
|
||||
assert_eq!(fmt_int(12_345_678), "12,345,678");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn fmt_bytes_units() {
|
||||
assert_eq!(fmt_bytes(500), "500 B");
|
||||
assert_eq!(fmt_bytes(1024), "1 KB");
|
||||
assert_eq!(fmt_bytes(1024 * 1024), "1.0 MB");
|
||||
assert_eq!(fmt_bytes(1024 * 1024 * 3 + 1024 * 512), "3.5 MB");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn matches_fact_word_boundary() {
|
||||
assert!(matches_fact("the api is ready", "API"));
|
||||
// single-token alphanumeric: API should not hit apiece
|
||||
assert!(!matches_fact("an apiece of land", "API"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn matches_fact_multiword_substring() {
|
||||
assert!(matches_fact("uptime is 99.999% this year", "99.999"));
|
||||
assert!(matches_fact("the app router routes requests", "App Router"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn matches_fact_case_insensitive() {
|
||||
assert!(matches_fact("the claude model is opus", "Claude"));
|
||||
assert!(matches_fact("the claude model is opus", "opus"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn matches_fact_missing() {
|
||||
assert!(!matches_fact("nothing to see here", "vercel"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn display_host_parses_url() {
|
||||
assert_eq!(display_host("https://stripe.com/"), "stripe.com");
|
||||
assert_eq!(
|
||||
display_host("https://docs.python.org/3/"),
|
||||
"docs.python.org"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn display_host_falls_back_on_garbage() {
|
||||
assert_eq!(display_host("not a url"), "not a url");
|
||||
}
|
||||
}
|
||||
|
|
@ -1,80 +0,0 @@
|
|||
/// Cloud API client for automatic fallback when local extraction fails.
|
||||
///
|
||||
/// When WEBCLAW_API_KEY is set (or --api-key is passed), the CLI can fall back
|
||||
/// to api.webclaw.io for bot-protected or JS-rendered sites. With --cloud flag,
|
||||
/// all requests go through the cloud API directly.
|
||||
///
|
||||
/// NOTE: The canonical, full-featured cloud module lives in webclaw-mcp/src/cloud.rs
|
||||
/// (smart_fetch, bot detection, JS rendering checks). This is the minimal subset
|
||||
/// needed by the CLI. Kept separate to avoid pulling in rmcp via webclaw-mcp.
|
||||
/// and adding webclaw-mcp as a dependency would pull in rmcp.
|
||||
use serde_json::{Value, json};
|
||||
|
||||
const API_BASE: &str = "https://api.webclaw.io/v1";
|
||||
|
||||
pub struct CloudClient {
|
||||
api_key: String,
|
||||
http: reqwest::Client,
|
||||
}
|
||||
|
||||
impl CloudClient {
|
||||
/// Create from explicit key or WEBCLAW_API_KEY env var.
|
||||
pub fn new(explicit_key: Option<&str>) -> Option<Self> {
|
||||
let key = explicit_key
|
||||
.map(String::from)
|
||||
.or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
|
||||
.filter(|k| !k.is_empty())?;
|
||||
|
||||
Some(Self {
|
||||
api_key: key,
|
||||
http: reqwest::Client::new(),
|
||||
})
|
||||
}
|
||||
|
||||
/// Scrape via the cloud API.
|
||||
pub async fn scrape(
|
||||
&self,
|
||||
url: &str,
|
||||
formats: &[&str],
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
) -> Result<Value, String> {
|
||||
let mut body = json!({
|
||||
"url": url,
|
||||
"formats": formats,
|
||||
});
|
||||
if only_main_content {
|
||||
body["only_main_content"] = json!(true);
|
||||
}
|
||||
if !include_selectors.is_empty() {
|
||||
body["include_selectors"] = json!(include_selectors);
|
||||
}
|
||||
if !exclude_selectors.is_empty() {
|
||||
body["exclude_selectors"] = json!(exclude_selectors);
|
||||
}
|
||||
self.post("scrape", body).await
|
||||
}
|
||||
|
||||
async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
|
||||
let resp = self
|
||||
.http
|
||||
.post(format!("{API_BASE}/{endpoint}"))
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.json(&body)
|
||||
.timeout(std::time::Duration::from_secs(120))
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("cloud API request failed: {e}"))?;
|
||||
|
||||
let status = resp.status();
|
||||
if !status.is_success() {
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
return Err(format!("cloud API error {status}: {text}"));
|
||||
}
|
||||
|
||||
resp.json::<Value>()
|
||||
.await
|
||||
.map_err(|e| format!("cloud API response parse failed: {e}"))
|
||||
}
|
||||
}
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
/// CLI entry point -- wires webclaw-core and webclaw-fetch into a single command.
|
||||
/// All extraction and fetching logic lives in sibling crates; this is pure plumbing.
|
||||
mod cloud;
|
||||
mod bench;
|
||||
|
||||
use std::io::{self, Read as _};
|
||||
use std::path::{Path, PathBuf};
|
||||
|
|
@ -8,7 +8,7 @@ use std::process;
|
|||
use std::sync::Arc;
|
||||
use std::sync::atomic::{AtomicBool, Ordering};
|
||||
|
||||
use clap::{Parser, ValueEnum};
|
||||
use clap::{Parser, Subcommand, ValueEnum};
|
||||
use tracing_subscriber::EnvFilter;
|
||||
use webclaw_core::{
|
||||
ChangeStatus, ContentDiff, ExtractionOptions, ExtractionResult, Metadata, extract_with_options,
|
||||
|
|
@ -86,6 +86,12 @@ fn warn_empty(url: &str, reason: &EmptyReason) {
|
|||
#[derive(Parser)]
|
||||
#[command(name = "webclaw", about = "Extract web content for LLMs", version)]
|
||||
struct Cli {
|
||||
/// Optional subcommand. When omitted, the CLI falls back to the
|
||||
/// traditional flag-based flow (URL + --format, --crawl, etc.).
|
||||
/// Subcommands are used for flows that don't fit that model.
|
||||
#[command(subcommand)]
|
||||
command: Option<Commands>,
|
||||
|
||||
/// URLs to fetch (multiple allowed)
|
||||
#[arg()]
|
||||
urls: Vec<String>,
|
||||
|
|
@ -283,6 +289,55 @@ struct Cli {
|
|||
output_dir: Option<PathBuf>,
|
||||
}
|
||||
|
||||
#[derive(Subcommand)]
|
||||
enum Commands {
|
||||
/// Per-URL extraction micro-benchmark: compares raw HTML vs. the
|
||||
/// webclaw --format llm output on token count, bytes, and
|
||||
/// extraction time. Uses an approximate tokenizer (see `--help`).
|
||||
Bench {
|
||||
/// URL to benchmark.
|
||||
url: String,
|
||||
|
||||
/// Emit a single JSON line instead of the ASCII table.
|
||||
/// Machine-readable shape stable across releases.
|
||||
#[arg(long)]
|
||||
json: bool,
|
||||
|
||||
/// Optional path to a facts.json (same schema as the repo's
|
||||
/// benchmarks/facts.json) for a fidelity column.
|
||||
#[arg(long)]
|
||||
facts: Option<PathBuf>,
|
||||
},
|
||||
|
||||
/// List all vertical extractors in the catalog.
|
||||
///
|
||||
/// Each entry has a stable `name` (usable with `webclaw vertical <name>`),
|
||||
/// a human-friendly label, a one-line description, and the URL
|
||||
/// patterns it claims. The same data is served by `/v1/extractors`
|
||||
/// when running the REST API.
|
||||
Extractors {
|
||||
/// Emit JSON instead of a human-friendly table.
|
||||
#[arg(long)]
|
||||
json: bool,
|
||||
},
|
||||
|
||||
/// Run a vertical extractor by name. Returns typed JSON with fields
|
||||
/// specific to the target site (title, price, author, rating, etc.)
|
||||
/// rather than generic markdown.
|
||||
///
|
||||
/// Use `webclaw extractors` to see the full list. Example:
|
||||
/// `webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/`.
|
||||
Vertical {
|
||||
/// Vertical name (e.g. `reddit`, `github_repo`, `trustpilot_reviews`).
|
||||
name: String,
|
||||
/// URL to extract.
|
||||
url: String,
|
||||
/// Emit compact JSON (single line). Default is pretty-printed.
|
||||
#[arg(long)]
|
||||
raw: bool,
|
||||
},
|
||||
}
|
||||
|
||||
#[derive(Clone, ValueEnum)]
|
||||
enum OutputFormat {
|
||||
Markdown,
|
||||
|
|
@ -296,6 +351,9 @@ enum OutputFormat {
|
|||
enum Browser {
|
||||
Chrome,
|
||||
Firefox,
|
||||
/// Safari iOS 26. Pair with a country-matched residential proxy for sites
|
||||
/// that reject non-mobile profiles.
|
||||
SafariIos,
|
||||
Random,
|
||||
}
|
||||
|
||||
|
|
@ -322,6 +380,7 @@ impl From<Browser> for BrowserProfile {
|
|||
match b {
|
||||
Browser::Chrome => BrowserProfile::Chrome,
|
||||
Browser::Firefox => BrowserProfile::Firefox,
|
||||
Browser::SafariIos => BrowserProfile::SafariIos,
|
||||
Browser::Random => BrowserProfile::Random,
|
||||
}
|
||||
}
|
||||
|
|
@ -646,7 +705,7 @@ async fn fetch_and_extract(cli: &Cli) -> Result<FetchOutput, String> {
|
|||
let url = normalize_url(raw_url);
|
||||
let url = url.as_str();
|
||||
|
||||
let cloud_client = cloud::CloudClient::new(cli.api_key.as_deref());
|
||||
let cloud_client = webclaw_fetch::cloud::CloudClient::new(cli.api_key.as_deref());
|
||||
|
||||
// --cloud: skip local, go straight to cloud API
|
||||
if cli.cloud {
|
||||
|
|
@ -2244,6 +2303,103 @@ async fn main() {
|
|||
let cli = Cli::parse();
|
||||
init_logging(cli.verbose);
|
||||
|
||||
// Subcommand path. Handled before the flag dispatch so a subcommand
|
||||
// can't collide with a flag-based flow. When no subcommand is set
|
||||
// we fall through to the existing behaviour.
|
||||
if let Some(ref cmd) = cli.command {
|
||||
match cmd {
|
||||
Commands::Bench { url, json, facts } => {
|
||||
let args = bench::BenchArgs {
|
||||
url: url.clone(),
|
||||
json: *json,
|
||||
facts: facts.clone(),
|
||||
};
|
||||
if let Err(e) = bench::run(&args).await {
|
||||
eprintln!("error: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
return;
|
||||
}
|
||||
Commands::Extractors { json } => {
|
||||
let entries = webclaw_fetch::extractors::list();
|
||||
if *json {
|
||||
// Serialize with serde_json. ExtractorInfo derives
|
||||
// Serialize so this is a one-liner.
|
||||
match serde_json::to_string_pretty(&entries) {
|
||||
Ok(s) => println!("{s}"),
|
||||
Err(e) => {
|
||||
eprintln!("error: failed to serialise catalog: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Human-friendly table: NAME + LABEL + one URL
|
||||
// pattern sample. Keeps the output scannable on a
|
||||
// narrow terminal.
|
||||
println!("{} vertical extractors available:\n", entries.len());
|
||||
let name_w = entries.iter().map(|e| e.name.len()).max().unwrap_or(0);
|
||||
let label_w = entries.iter().map(|e| e.label.len()).max().unwrap_or(0);
|
||||
for e in &entries {
|
||||
let pattern_sample = e.url_patterns.first().copied().unwrap_or("");
|
||||
println!(
|
||||
" {:<nw$} {:<lw$} {}",
|
||||
e.name,
|
||||
e.label,
|
||||
pattern_sample,
|
||||
nw = name_w,
|
||||
lw = label_w,
|
||||
);
|
||||
}
|
||||
println!("\nRun one: webclaw vertical <name> <url>");
|
||||
}
|
||||
return;
|
||||
}
|
||||
Commands::Vertical { name, url, raw } => {
|
||||
// Build a FetchClient with cloud fallback attached when
|
||||
// WEBCLAW_API_KEY is set. Antibot-gated verticals
|
||||
// (amazon, ebay, etsy, trustpilot) need this to escalate
|
||||
// on bot protection.
|
||||
let fetch_cfg = webclaw_fetch::FetchConfig {
|
||||
browser: webclaw_fetch::BrowserProfile::Firefox,
|
||||
..webclaw_fetch::FetchConfig::default()
|
||||
};
|
||||
let mut client = match webclaw_fetch::FetchClient::new(fetch_cfg) {
|
||||
Ok(c) => c,
|
||||
Err(e) => {
|
||||
eprintln!("error: failed to build fetch client: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
};
|
||||
if let Some(cloud) = webclaw_fetch::cloud::CloudClient::from_env() {
|
||||
client = client.with_cloud(cloud);
|
||||
}
|
||||
match webclaw_fetch::extractors::dispatch_by_name(&client, name, url).await {
|
||||
Ok(data) => {
|
||||
let rendered = if *raw {
|
||||
serde_json::to_string(&data)
|
||||
} else {
|
||||
serde_json::to_string_pretty(&data)
|
||||
};
|
||||
match rendered {
|
||||
Ok(s) => println!("{s}"),
|
||||
Err(e) => {
|
||||
eprintln!("error: JSON encode failed: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
}
|
||||
}
|
||||
Err(e) => {
|
||||
// UrlMismatch / UnknownVertical / Fetch all get
|
||||
// Display impls with actionable messages.
|
||||
eprintln!("error: {e}");
|
||||
process::exit(1);
|
||||
}
|
||||
}
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// --map: sitemap discovery mode
|
||||
if cli.map {
|
||||
if let Err(e) = run_map(&cli).await {
|
||||
|
|
|
|||
|
|
@ -920,8 +920,10 @@ fn strip_markdown(md: &str) -> String {
|
|||
continue;
|
||||
}
|
||||
|
||||
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs
|
||||
if trimmed.starts_with('|') && trimmed.ends_with('|') {
|
||||
// Convert table data rows: strip leading/trailing pipes, replace inner pipes with tabs.
|
||||
// Require at least 2 chars so the slice `[1..len-1]` stays non-empty on single-pipe rows
|
||||
// (which aren't real tables anyway); a lone `|` previously panicked at `begin <= end`.
|
||||
if trimmed.len() >= 2 && trimmed.starts_with('|') && trimmed.ends_with('|') {
|
||||
let inner = &trimmed[1..trimmed.len() - 1];
|
||||
let cells: Vec<&str> = inner.split('|').map(|c| c.trim()).collect();
|
||||
lines.push(cells.join("\t"));
|
||||
|
|
|
|||
|
|
@ -12,12 +12,16 @@ serde = { workspace = true }
|
|||
thiserror = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
tokio = { workspace = true }
|
||||
async-trait = "0.1"
|
||||
wreq = { version = "6.0.0-rc.28", features = ["cookies", "gzip", "brotli", "zstd", "deflate"] }
|
||||
wreq-util = "3.0.0-rc.10"
|
||||
http = "1"
|
||||
bytes = "1"
|
||||
url = "2"
|
||||
rand = "0.8"
|
||||
quick-xml = { version = "0.37", features = ["serde"] }
|
||||
regex = "1"
|
||||
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
|
||||
serde_json.workspace = true
|
||||
calamine = "0.34"
|
||||
zip = "2"
|
||||
|
|
|
|||
|
|
@ -7,6 +7,10 @@ pub enum BrowserProfile {
|
|||
#[default]
|
||||
Chrome,
|
||||
Firefox,
|
||||
/// Safari iOS 26 (iPhone). The one profile proven to defeat
|
||||
/// DataDome's immobiliare.it / idealista.it / target.com-class
|
||||
/// rules when paired with a country-scoped residential proxy.
|
||||
SafariIos,
|
||||
/// Randomly pick from all available profiles on each request.
|
||||
Random,
|
||||
}
|
||||
|
|
@ -18,6 +22,7 @@ pub enum BrowserVariant {
|
|||
ChromeMacos,
|
||||
Firefox,
|
||||
Safari,
|
||||
SafariIos26,
|
||||
Edge,
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -177,6 +177,11 @@ enum ClientPool {
|
|||
pub struct FetchClient {
|
||||
pool: ClientPool,
|
||||
pdf_mode: PdfMode,
|
||||
/// Optional cloud-fallback client. Extractors that need to
|
||||
/// escalate past bot protection call `client.cloud()` to get this
|
||||
/// out. Stored as `Arc` so cloning a `FetchClient` (common in
|
||||
/// axum state) doesn't clone the underlying reqwest pool.
|
||||
cloud: Option<std::sync::Arc<crate::cloud::CloudClient>>,
|
||||
}
|
||||
|
||||
impl FetchClient {
|
||||
|
|
@ -225,13 +230,96 @@ impl FetchClient {
|
|||
ClientPool::Rotating { clients }
|
||||
};
|
||||
|
||||
Ok(Self { pool, pdf_mode })
|
||||
Ok(Self {
|
||||
pool,
|
||||
pdf_mode,
|
||||
cloud: None,
|
||||
})
|
||||
}
|
||||
|
||||
/// Attach a cloud-fallback client. Returns `self` so it composes in
|
||||
/// a builder-ish way:
|
||||
///
|
||||
/// ```ignore
|
||||
/// let client = FetchClient::new(config)?
|
||||
/// .with_cloud(CloudClient::from_env()?);
|
||||
/// ```
|
||||
///
|
||||
/// Extractors that can escalate past bot protection will call
|
||||
/// `client.cloud()` internally. Sets the field regardless of
|
||||
/// whether `cloud` is configured to bypass anything specific —
|
||||
/// attachment is cheap (just wraps in `Arc`).
|
||||
pub fn with_cloud(mut self, cloud: crate::cloud::CloudClient) -> Self {
|
||||
self.cloud = Some(std::sync::Arc::new(cloud));
|
||||
self
|
||||
}
|
||||
|
||||
/// Optional cloud-fallback client, if one was attached via
|
||||
/// [`Self::with_cloud`]. Extractors that handle antibot sites
|
||||
/// pass this into `cloud::smart_fetch_html`.
|
||||
pub fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
|
||||
self.cloud.as_deref()
|
||||
}
|
||||
|
||||
/// Fetch a URL with per-site rescue paths: Reddit URLs redirect to the
|
||||
/// `.json` API, and Akamai-style challenge responses trigger a homepage
|
||||
/// cookie warmup and a retry. Returns the same `FetchResult` shape as
|
||||
/// [`Self::fetch`] so every caller (CLI, MCP, OSS server, production
|
||||
/// server) benefits without shape churn.
|
||||
///
|
||||
/// This is the method most callers want. Use plain [`Self::fetch`] only
|
||||
/// when you need literal no-rescue behavior (e.g. inside the rescue
|
||||
/// logic itself to avoid recursion).
|
||||
pub async fn fetch_smart(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
// Reddit: the HTML page shows a verification interstitial for most
|
||||
// client IPs, but appending `.json` returns the post + comment tree
|
||||
// publicly. `parse_reddit_json` in downstream code knows how to read
|
||||
// the result; here we just do the URL swap at the fetch layer.
|
||||
if crate::reddit::is_reddit_url(url) && !url.ends_with(".json") {
|
||||
let json_url = crate::reddit::json_url(url);
|
||||
// Reddit's public .json API serves JSON to identifiable bot
|
||||
// User-Agents and blocks browser UAs with a verification wall.
|
||||
// Override our Chrome-profile UA for this specific call.
|
||||
let ua = concat!(
|
||||
"Webclaw/",
|
||||
env!("CARGO_PKG_VERSION"),
|
||||
" (+https://webclaw.io)"
|
||||
);
|
||||
if let Ok(resp) = self
|
||||
.fetch_with_headers(&json_url, &[("user-agent", ua)])
|
||||
.await
|
||||
&& resp.status == 200
|
||||
{
|
||||
let first = resp.html.trim_start().as_bytes().first().copied();
|
||||
if matches!(first, Some(b'{') | Some(b'[')) {
|
||||
return Ok(resp);
|
||||
}
|
||||
}
|
||||
// If the .json fetch failed or returned HTML, fall through.
|
||||
}
|
||||
|
||||
let resp = self.fetch(url).await?;
|
||||
|
||||
// Akamai / bazadebezolkohpepadr challenge: visit the homepage to
|
||||
// collect warmup cookies (_abck, bm_sz, etc.), then retry.
|
||||
if is_challenge_html(&resp.html)
|
||||
&& let Some(homepage) = extract_homepage(url)
|
||||
{
|
||||
debug!("challenge detected, warming cookies via {homepage}");
|
||||
let _ = self.fetch(&homepage).await;
|
||||
if let Ok(retry) = self.fetch(url).await {
|
||||
return Ok(retry);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(resp)
|
||||
}
|
||||
|
||||
/// Fetch a URL and return the raw HTML + response metadata.
|
||||
///
|
||||
/// Automatically retries on transient failures (network errors, 5xx, 429)
|
||||
/// with exponential backoff: 0s, 1s (2 attempts total).
|
||||
/// with exponential backoff: 0s, 1s (2 attempts total). No per-site
|
||||
/// rescue logic; use [`Self::fetch_smart`] for that.
|
||||
#[instrument(skip(self), fields(url = %url))]
|
||||
pub async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
let delays = [Duration::ZERO, Duration::from_secs(1)];
|
||||
|
|
@ -279,14 +367,85 @@ impl FetchClient {
|
|||
|
||||
/// Single fetch attempt.
|
||||
async fn fetch_once(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
self.fetch_once_with_headers(url, &[]).await
|
||||
}
|
||||
|
||||
/// Single fetch attempt with optional per-request headers appended
|
||||
/// after the profile defaults. Used by extractors that need to
|
||||
/// satisfy site-specific headers (e.g. `x-ig-app-id` for Instagram's
|
||||
/// internal API).
|
||||
async fn fetch_once_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
extra: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
let start = Instant::now();
|
||||
let client = self.pick_client(url);
|
||||
|
||||
let resp = client.get(url).send().await?;
|
||||
let mut req = client.get(url);
|
||||
for (k, v) in extra {
|
||||
req = req.header(*k, *v);
|
||||
}
|
||||
let resp = req.send().await?;
|
||||
let response = Response::from_wreq(resp).await?;
|
||||
response_to_result(response, start)
|
||||
}
|
||||
|
||||
/// Fetch a URL with extra per-request headers appended after the
|
||||
/// browser-profile defaults. Same retry semantics as `fetch`.
|
||||
///
|
||||
/// Use this when an upstream API requires a header the global
|
||||
/// `FetchConfig.headers` shouldn't carry to other hosts (Instagram's
|
||||
/// `x-ig-app-id`, GitHub's `Authorization` once we wire `GITHUB_TOKEN`,
|
||||
/// Reddit's compliant UA when we add OAuth, etc.).
|
||||
#[instrument(skip(self, extra), fields(url = %url, extra_count = extra.len()))]
|
||||
pub async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
extra: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
let delays = [Duration::ZERO, Duration::from_secs(1)];
|
||||
let mut last_err = None;
|
||||
|
||||
for (attempt, delay) in delays.iter().enumerate() {
|
||||
if attempt > 0 {
|
||||
tokio::time::sleep(*delay).await;
|
||||
}
|
||||
match self.fetch_once_with_headers(url, extra).await {
|
||||
Ok(result) => {
|
||||
if is_retryable_status(result.status) && attempt < delays.len() - 1 {
|
||||
warn!(
|
||||
url,
|
||||
status = result.status,
|
||||
attempt = attempt + 1,
|
||||
"retryable status, will retry"
|
||||
);
|
||||
last_err = Some(FetchError::Build(format!("HTTP {}", result.status)));
|
||||
continue;
|
||||
}
|
||||
if attempt > 0 {
|
||||
debug!(url, attempt = attempt + 1, "retry succeeded");
|
||||
}
|
||||
return Ok(result);
|
||||
}
|
||||
Err(e) => {
|
||||
if !is_retryable_error(&e) || attempt == delays.len() - 1 {
|
||||
return Err(e);
|
||||
}
|
||||
warn!(
|
||||
url,
|
||||
error = %e,
|
||||
attempt = attempt + 1,
|
||||
"transient error, will retry"
|
||||
);
|
||||
last_err = Some(e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Err(last_err.unwrap_or_else(|| FetchError::Build("all retries exhausted".into())))
|
||||
}
|
||||
|
||||
/// Fetch a URL then extract structured content.
|
||||
#[instrument(skip(self), fields(url = %url))]
|
||||
pub async fn fetch_and_extract(
|
||||
|
|
@ -495,12 +654,43 @@ impl FetchClient {
|
|||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Fetcher trait implementation
|
||||
//
|
||||
// Vertical extractors consume the [`crate::fetcher::Fetcher`] trait
|
||||
// rather than `FetchClient` directly, which is what lets the production
|
||||
// API server swap in a tls-sidecar-backed implementation without
|
||||
// pulling wreq into its dependency graph. For everyone else (CLI, MCP,
|
||||
// self-hosted OSS server) this impl means "pass the FetchClient you
|
||||
// already have; nothing changes".
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[async_trait::async_trait]
|
||||
impl crate::fetcher::Fetcher for FetchClient {
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
FetchClient::fetch(self, url).await
|
||||
}
|
||||
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
FetchClient::fetch_with_headers(self, url, headers).await
|
||||
}
|
||||
|
||||
fn cloud(&self) -> Option<&crate::cloud::CloudClient> {
|
||||
FetchClient::cloud(self)
|
||||
}
|
||||
}
|
||||
|
||||
/// Collect the browser variants to use based on the browser profile.
|
||||
fn collect_variants(profile: &BrowserProfile) -> Vec<BrowserVariant> {
|
||||
match profile {
|
||||
BrowserProfile::Random => browser::all_variants(),
|
||||
BrowserProfile::Chrome => vec![browser::latest_chrome()],
|
||||
BrowserProfile::Firefox => vec![browser::latest_firefox()],
|
||||
BrowserProfile::SafariIos => vec![BrowserVariant::SafariIos26],
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -578,22 +768,23 @@ fn is_pdf_content_type(headers: &http::HeaderMap) -> bool {
|
|||
|
||||
/// Detect if a response looks like a bot protection challenge page.
|
||||
fn is_challenge_response(response: &Response) -> bool {
|
||||
let len = response.body().len();
|
||||
is_challenge_html(response.text().as_ref())
|
||||
}
|
||||
|
||||
/// Same as `is_challenge_response`, operating on a body string directly
|
||||
/// so callers holding a `FetchResult` can reuse the heuristic.
|
||||
fn is_challenge_html(html: &str) -> bool {
|
||||
let len = html.len();
|
||||
if len > 15_000 || len == 0 {
|
||||
return false;
|
||||
}
|
||||
|
||||
let text = response.text();
|
||||
let lower = text.to_lowercase();
|
||||
|
||||
let lower = html.to_lowercase();
|
||||
if lower.contains("<title>challenge page</title>") {
|
||||
return true;
|
||||
}
|
||||
|
||||
if lower.contains("bazadebezolkohpepadr") && len < 5_000 {
|
||||
return true;
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
|
|
|
|||
853
crates/webclaw-fetch/src/cloud.rs
Normal file
853
crates/webclaw-fetch/src/cloud.rs
Normal file
|
|
@ -0,0 +1,853 @@
|
|||
//! Cloud API fallback client for api.webclaw.io.
|
||||
//!
|
||||
//! When local fetch hits bot protection or a JS-only SPA, callers can
|
||||
//! fall back to the hosted API which runs the full antibot / CDP
|
||||
//! pipeline. This module is the shared home for that flow: previously
|
||||
//! duplicated between `webclaw-mcp/src/cloud.rs` and
|
||||
//! `webclaw-cli/src/cloud.rs`.
|
||||
//!
|
||||
//! ## Architecture
|
||||
//!
|
||||
//! - [`CloudClient`] — thin reqwest wrapper around the api.webclaw.io
|
||||
//! REST surface. Typed errors for the four HTTP failures callers act
|
||||
//! on differently (401 / 402 / 429 / other) plus network + parse.
|
||||
//! - [`is_bot_protected`] / [`needs_js_rendering`] — pure detectors on
|
||||
//! response bodies. The detection patterns are public (CF / DataDome
|
||||
//! challenge-page signatures) so these live in OSS without leaking
|
||||
//! any moat.
|
||||
//! - [`smart_fetch`] — try-local-then-escalate flow returning an
|
||||
//! [`ExtractionResult`] or raw cloud JSON. Kept on the original
|
||||
//! `Result<_, String>` signature so the existing MCP / CLI call
|
||||
//! sites work unchanged.
|
||||
//! - [`smart_fetch_html`] — new convenience for the vertical-extractor
|
||||
//! pattern: just give me antibot-bypassed HTML so I can run my own
|
||||
//! parser on it. Returns the typed [`CloudError`] so extractors can
|
||||
//! emit precise "upgrade your plan" / "invalid key" messages.
|
||||
//!
|
||||
//! ## Cloud response shape and [`synthesize_html`]
|
||||
//!
|
||||
//! `api.webclaw.io/v1/scrape` deliberately does **not** return a
|
||||
//! `html` field even when `formats=["html"]` is requested. By design
|
||||
//! the cloud API returns a parsed bundle:
|
||||
//!
|
||||
//! ```text
|
||||
//! {
|
||||
//! "url": "https://...",
|
||||
//! "metadata": { title, description, image, site_name, ... }, // OG / meta tags
|
||||
//! "structured_data": [ { "@type": "...", ... }, ... ], // JSON-LD blocks
|
||||
//! "markdown": "# Page Title\n\n...", // cleaned markdown
|
||||
//! "antibot": { engine, path, user_agent }, // bypass telemetry
|
||||
//! "cache": { status, age_seconds }
|
||||
//! }
|
||||
//! ```
|
||||
//!
|
||||
//! [`CloudClient::fetch_html`] reassembles that bundle back into a
|
||||
//! minimal synthetic HTML document so the existing local extractor
|
||||
//! parsers (JSON-LD walkers, OG regex, DOM-regex) run unchanged over
|
||||
//! cloud output. Each `structured_data` entry becomes a
|
||||
//! `<script type="application/ld+json">` tag; each `metadata` field
|
||||
//! becomes a `<meta property="og:...">` tag; `markdown` lands in a
|
||||
//! `<pre>` inside the body. Callers that walk Schema.org blocks see
|
||||
//! exactly what they'd see on a real live page.
|
||||
//!
|
||||
//! Amazon-style DOM-regex fallbacks (`#productTitle`, `#landingImage`)
|
||||
//! won't hit on the synthesised HTML — those IDs only exist on live
|
||||
//! Amazon pages. Extractors that need DOM regex keep OG meta tag
|
||||
//! fallbacks for that reason.
|
||||
//!
|
||||
//! OSS users without `WEBCLAW_API_KEY` get a clear error pointing at
|
||||
//! signup when a site is blocked; nothing fails silently. Cloud users
|
||||
//! get the escalation for free.
|
||||
|
||||
use std::time::Duration;
|
||||
|
||||
use http::HeaderMap;
|
||||
use serde_json::{Value, json};
|
||||
use thiserror::Error;
|
||||
use tracing::{debug, info, warn};
|
||||
|
||||
// Client type isn't needed here anymore now that smart_fetch* takes
|
||||
// `&dyn Fetcher`. Kept as a comment for historical context: this
|
||||
// module used to import FetchClient directly before v0.5.1.
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URLs + defaults — keep in one place so "change the signup link" is a
|
||||
// single-commit edit.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
const API_BASE_DEFAULT: &str = "https://api.webclaw.io/v1";
|
||||
const DEFAULT_TIMEOUT_SECS: u64 = 120;
|
||||
|
||||
const SIGNUP_URL: &str = "https://webclaw.io/signup";
|
||||
const PRICING_URL: &str = "https://webclaw.io/pricing";
|
||||
const KEYS_URL: &str = "https://webclaw.io/dashboard/api-keys";
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Errors
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Structured cloud-fallback error. Variants correspond to the HTTP
|
||||
/// outcomes callers act on differently — a 401 needs a different UX
|
||||
/// than a 402 which needs a different UX than a network blip.
|
||||
///
|
||||
/// Display messages end with an actionable URL so API consumers can
|
||||
/// surface them to users verbatim.
|
||||
#[derive(Debug, Error)]
|
||||
pub enum CloudError {
|
||||
/// No `WEBCLAW_API_KEY` configured. Returned by [`smart_fetch_html`]
|
||||
/// and friends when they hit bot protection but have no client to
|
||||
/// escalate to.
|
||||
#[error(
|
||||
"this site is behind antibot protection. \
|
||||
Set WEBCLAW_API_KEY to unlock automatic cloud bypass. \
|
||||
Free tier: {SIGNUP_URL}"
|
||||
)]
|
||||
NotConfigured,
|
||||
|
||||
/// HTTP 401 — the key is present but rejected.
|
||||
#[error(
|
||||
"WEBCLAW_API_KEY rejected (HTTP 401). \
|
||||
Check or regenerate your key at {KEYS_URL}"
|
||||
)]
|
||||
Unauthorized,
|
||||
|
||||
/// HTTP 402 — the key is valid but the plan doesn't cover the call.
|
||||
#[error(
|
||||
"your plan doesn't include this endpoint / site (HTTP 402). \
|
||||
Upgrade at {PRICING_URL}"
|
||||
)]
|
||||
InsufficientPlan,
|
||||
|
||||
/// HTTP 429 — rate limit.
|
||||
#[error(
|
||||
"cloud API rate limit reached (HTTP 429). \
|
||||
Wait a moment or upgrade at {PRICING_URL}"
|
||||
)]
|
||||
RateLimited,
|
||||
|
||||
/// HTTP 4xx / 5xx the caller probably can't do anything specific
|
||||
/// about. Body is truncated to a sensible length for logs.
|
||||
#[error("cloud API returned HTTP {status}: {body}")]
|
||||
ServerError { status: u16, body: String },
|
||||
|
||||
#[error("cloud request failed: {0}")]
|
||||
Network(String),
|
||||
|
||||
#[error("cloud response parse failed: {0}")]
|
||||
ParseFailed(String),
|
||||
}
|
||||
|
||||
impl CloudError {
|
||||
/// Build from a non-success HTTP response, routing well-known
|
||||
/// statuses to dedicated variants.
|
||||
fn from_status_and_body(status: u16, body: String) -> Self {
|
||||
match status {
|
||||
401 => Self::Unauthorized,
|
||||
402 => Self::InsufficientPlan,
|
||||
429 => Self::RateLimited,
|
||||
_ => Self::ServerError {
|
||||
status,
|
||||
body: truncate(&body, 500).to_string(),
|
||||
},
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl From<reqwest::Error> for CloudError {
|
||||
fn from(e: reqwest::Error) -> Self {
|
||||
Self::Network(e.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
/// Backwards-compatibility bridge: a lot of pre-existing MCP / CLI call
|
||||
/// sites `use .await?` into functions returning `Result<_, String>`.
|
||||
/// Having this `From` impl means those sites keep compiling while we
|
||||
/// migrate them to the typed error over time.
|
||||
impl From<CloudError> for String {
|
||||
fn from(e: CloudError) -> Self {
|
||||
e.to_string()
|
||||
}
|
||||
}
|
||||
|
||||
fn truncate(text: &str, max: usize) -> &str {
|
||||
match text.char_indices().nth(max) {
|
||||
Some((byte_pos, _)) => &text[..byte_pos],
|
||||
None => text,
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// CloudClient
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Thin reqwest client around api.webclaw.io. Cloneable cheaply — the
|
||||
/// inner `reqwest::Client` already refcounts its connection pool.
|
||||
#[derive(Clone)]
|
||||
pub struct CloudClient {
|
||||
api_key: String,
|
||||
base_url: String,
|
||||
http: reqwest::Client,
|
||||
}
|
||||
|
||||
impl CloudClient {
|
||||
/// Build from an explicit key (e.g. a `--api-key` CLI flag) or fall
|
||||
/// back to the `WEBCLAW_API_KEY` env var. Returns `None` when
|
||||
/// neither is set / both are empty.
|
||||
///
|
||||
/// This is the function call sites should use by default — it's
|
||||
/// what both the CLI and MCP want.
|
||||
pub fn new(explicit_key: Option<&str>) -> Option<Self> {
|
||||
explicit_key
|
||||
.map(String::from)
|
||||
.or_else(|| std::env::var("WEBCLAW_API_KEY").ok())
|
||||
.filter(|k| !k.trim().is_empty())
|
||||
.map(Self::with_key)
|
||||
}
|
||||
|
||||
/// Build from `WEBCLAW_API_KEY` env only. Thin wrapper kept for
|
||||
/// readability at call sites that never accept a flag.
|
||||
pub fn from_env() -> Option<Self> {
|
||||
Self::new(None)
|
||||
}
|
||||
|
||||
/// Build with an explicit key. Useful when the caller already has
|
||||
/// a key from somewhere other than env or a flag (e.g. loaded from
|
||||
/// config).
|
||||
pub fn with_key(api_key: impl Into<String>) -> Self {
|
||||
Self::with_key_and_base(api_key, API_BASE_DEFAULT)
|
||||
}
|
||||
|
||||
/// Build with an explicit key and base URL. Used by integration
|
||||
/// tests and staging deployments.
|
||||
pub fn with_key_and_base(api_key: impl Into<String>, base_url: impl Into<String>) -> Self {
|
||||
let http = reqwest::Client::builder()
|
||||
.timeout(Duration::from_secs(DEFAULT_TIMEOUT_SECS))
|
||||
.build()
|
||||
.expect("reqwest client builder failed with default settings");
|
||||
Self {
|
||||
api_key: api_key.into(),
|
||||
base_url: base_url.into().trim_end_matches('/').to_string(),
|
||||
http,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn base_url(&self) -> &str {
|
||||
&self.base_url
|
||||
}
|
||||
|
||||
/// Generic POST. Endpoint may be `"scrape"` or `"/scrape"` — we
|
||||
/// normalise the slash.
|
||||
pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, CloudError> {
|
||||
let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
|
||||
let resp = self
|
||||
.http
|
||||
.post(&url)
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.json(&body)
|
||||
.send()
|
||||
.await?;
|
||||
parse_cloud_response(resp).await
|
||||
}
|
||||
|
||||
/// Generic GET.
|
||||
pub async fn get(&self, endpoint: &str) -> Result<Value, CloudError> {
|
||||
let url = format!("{}/{}", self.base_url, endpoint.trim_start_matches('/'));
|
||||
let resp = self
|
||||
.http
|
||||
.get(&url)
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.send()
|
||||
.await?;
|
||||
parse_cloud_response(resp).await
|
||||
}
|
||||
|
||||
/// `POST /v1/scrape` with the caller's extraction options. This is
|
||||
/// the public "do everything" surface: the cloud side handles
|
||||
/// fetch + antibot + JS render + extraction + formatting.
|
||||
pub async fn scrape(
|
||||
&self,
|
||||
url: &str,
|
||||
formats: &[&str],
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
) -> Result<Value, CloudError> {
|
||||
let mut body = json!({ "url": url, "formats": formats });
|
||||
if only_main_content {
|
||||
body["only_main_content"] = json!(true);
|
||||
}
|
||||
if !include_selectors.is_empty() {
|
||||
body["include_selectors"] = json!(include_selectors);
|
||||
}
|
||||
if !exclude_selectors.is_empty() {
|
||||
body["exclude_selectors"] = json!(exclude_selectors);
|
||||
}
|
||||
self.post("scrape", body).await
|
||||
}
|
||||
|
||||
/// Get antibot-bypassed page data back as a synthetic HTML string.
|
||||
///
|
||||
/// `api.webclaw.io/v1/scrape` intentionally does not return raw
|
||||
/// HTML: it returns pre-parsed `structured_data` (JSON-LD blocks)
|
||||
/// plus `metadata` (title, description, OG tags, image) plus a
|
||||
/// `markdown` body. We reassemble those into a minimal HTML doc
|
||||
/// that looks enough like the real page for our local extractor
|
||||
/// parsers to run unchanged: each JSON-LD block gets emitted as a
|
||||
/// `<script type="application/ld+json">` tag, metadata gets
|
||||
/// emitted as OG `<meta>` tags, and the markdown lands in the
|
||||
/// body. Extractors that walk JSON-LD (ecommerce_product,
|
||||
/// trustpilot_reviews, ebay_listing, etsy_listing, amazon_product)
|
||||
/// see exactly the same shapes they'd see from a live HTML fetch.
|
||||
pub async fn fetch_html(&self, url: &str) -> Result<String, CloudError> {
|
||||
let resp = self.scrape(url, &["markdown"], &[], &[], false).await?;
|
||||
Ok(synthesize_html(&resp))
|
||||
}
|
||||
}
|
||||
|
||||
/// Reassemble a minimal HTML document from a cloud `/v1/scrape`
|
||||
/// response so existing HTML-based extractor parsers can run against
|
||||
/// cloud output without a separate code path.
|
||||
fn synthesize_html(resp: &Value) -> String {
|
||||
let mut out = String::with_capacity(8_192);
|
||||
out.push_str("<html><head>\n");
|
||||
|
||||
// Metadata → OG meta tags. Keep keys stable with what local
|
||||
// extractors read: og:title, og:description, og:image, og:site_name.
|
||||
if let Some(meta) = resp.get("metadata").and_then(|m| m.as_object()) {
|
||||
for (src_key, og_key) in [
|
||||
("title", "title"),
|
||||
("description", "description"),
|
||||
("image", "image"),
|
||||
("site_name", "site_name"),
|
||||
] {
|
||||
if let Some(val) = meta.get(src_key).and_then(|v| v.as_str())
|
||||
&& !val.is_empty()
|
||||
{
|
||||
out.push_str(&format!(
|
||||
"<meta property=\"og:{og_key}\" content=\"{}\">\n",
|
||||
html_escape_attr(val)
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Structured data blocks → <script type="application/ld+json">.
|
||||
// Serialise losslessly so extract_json_ld's parser gets the same
|
||||
// shape it would get from a real page.
|
||||
if let Some(blocks) = resp.get("structured_data").and_then(|v| v.as_array()) {
|
||||
for block in blocks {
|
||||
if let Ok(s) = serde_json::to_string(block) {
|
||||
out.push_str("<script type=\"application/ld+json\">");
|
||||
out.push_str(&s);
|
||||
out.push_str("</script>\n");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
out.push_str("</head><body>\n");
|
||||
|
||||
// Markdown body → plaintext in <body>. Extractors that regex over
|
||||
// <div> IDs won't hit here, but they won't hit on local cloud
|
||||
// bypass either. OK to keep minimal.
|
||||
if let Some(md) = resp.get("markdown").and_then(|v| v.as_str()) {
|
||||
out.push_str("<pre>");
|
||||
out.push_str(&html_escape_text(md));
|
||||
out.push_str("</pre>\n");
|
||||
}
|
||||
|
||||
out.push_str("</body></html>");
|
||||
out
|
||||
}
|
||||
|
||||
fn html_escape_attr(s: &str) -> String {
|
||||
s.replace('&', "&")
|
||||
.replace('"', """)
|
||||
.replace('<', "<")
|
||||
.replace('>', ">")
|
||||
}
|
||||
|
||||
fn html_escape_text(s: &str) -> String {
|
||||
s.replace('&', "&")
|
||||
.replace('<', "<")
|
||||
.replace('>', ">")
|
||||
}
|
||||
|
||||
async fn parse_cloud_response(resp: reqwest::Response) -> Result<Value, CloudError> {
|
||||
let status = resp.status();
|
||||
if status.is_success() {
|
||||
return resp
|
||||
.json()
|
||||
.await
|
||||
.map_err(|e| CloudError::ParseFailed(e.to_string()));
|
||||
}
|
||||
let body = resp.text().await.unwrap_or_default();
|
||||
Err(CloudError::from_status_and_body(status.as_u16(), body))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Detection
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// True when a fetched response body is actually a bot-protection
|
||||
/// challenge page rather than the content the caller asked for.
|
||||
///
|
||||
/// Conservative — only fires on patterns that indicate the *entire*
|
||||
/// page is a challenge, not embedded CAPTCHAs on a real content page.
|
||||
pub fn is_bot_protected(html: &str, headers: &HeaderMap) -> bool {
|
||||
let html_lower = html.to_lowercase();
|
||||
|
||||
// Cloudflare challenge page.
|
||||
if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare "Just a moment" / "Checking your browser" interstitial.
|
||||
if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
&& html_lower.contains("cf-spinner")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare Turnstile. Only counts when the page is small —
|
||||
// legitimate pages embed Turnstile for signup forms etc.
|
||||
if (html_lower.contains("cf-turnstile")
|
||||
|| html_lower.contains("challenges.cloudflare.com/turnstile"))
|
||||
&& html.len() < 100_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// DataDome.
|
||||
if html_lower.contains("geo.captcha-delivery.com")
|
||||
|| html_lower.contains("captcha-delivery.com/captcha")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// AWS WAF.
|
||||
if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// AWS WAF "Verifying your connection" interstitial (used by Trustpilot).
|
||||
// Distinct from the captcha-branded path above: the challenge page is
|
||||
// a tiny HTML shell with an `interstitial-spinner` div and no content.
|
||||
// Gating on html.len() keeps false-positives off long pages that
|
||||
// happen to mention the phrase in an unrelated context.
|
||||
if html_lower.contains("interstitial-spinner")
|
||||
&& html_lower.contains("verifying your connection")
|
||||
&& html.len() < 10_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// hCaptcha *blocking* page (not just an embedded widget).
|
||||
if html_lower.contains("hcaptcha.com")
|
||||
&& html_lower.contains("h-captcha")
|
||||
&& html.len() < 50_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare via response headers + challenge body.
|
||||
let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
|
||||
if has_cf_headers
|
||||
&& (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// True when a page likely needs JS rendering — a large HTML document
|
||||
/// with almost no extractable text + an SPA framework signature.
|
||||
pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
|
||||
let has_scripts = html.contains("<script");
|
||||
|
||||
// Tier 1: almost no extractable text from a large-ish page.
|
||||
if word_count < 50 && html.len() > 5_000 && has_scripts {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Tier 2: SPA framework markers + low content-to-HTML ratio.
|
||||
if word_count < 800 && html.len() > 50_000 && has_scripts {
|
||||
let html_lower = html.to_lowercase();
|
||||
let has_spa_marker = html_lower.contains("react-app")
|
||||
|| html_lower.contains("id=\"__next\"")
|
||||
|| html_lower.contains("id=\"root\"")
|
||||
|| html_lower.contains("id=\"app\"")
|
||||
|| html_lower.contains("__next_data__")
|
||||
|| html_lower.contains("nuxt")
|
||||
|| html_lower.contains("ng-app");
|
||||
if has_spa_marker {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Smart-fetch: classic flow for MCP / CLI (returns either an extraction
|
||||
// or raw cloud JSON)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Result of [`smart_fetch`]: either a local extraction or the raw
|
||||
/// cloud API response when we escalated.
|
||||
pub enum SmartFetchResult {
|
||||
Local(Box<webclaw_core::ExtractionResult>),
|
||||
Cloud(Value),
|
||||
}
|
||||
|
||||
/// Try local fetch + extract first. On bot protection or detected
|
||||
/// JS-render, fall back to `cloud.scrape(...)` with the caller's
|
||||
/// formats. Returns `Err(String)` so existing call sites that expect
|
||||
/// stringified errors keep compiling.
|
||||
///
|
||||
/// Prefer [`smart_fetch_html`] for new callers — it surfaces the typed
|
||||
/// [`CloudError`] so you can render precise UX.
|
||||
pub async fn smart_fetch(
|
||||
client: &dyn crate::fetcher::Fetcher,
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
formats: &[&str],
|
||||
) -> Result<SmartFetchResult, String> {
|
||||
let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
|
||||
.await
|
||||
.map_err(|_| format!("Fetch timed out after 30s for {url}"))?
|
||||
.map_err(|e| format!("Fetch failed: {e}"))?;
|
||||
|
||||
if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
|
||||
info!(url, "bot protection detected, falling back to cloud API");
|
||||
return cloud_scrape_fallback(
|
||||
cloud,
|
||||
url,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
formats,
|
||||
)
|
||||
.await;
|
||||
}
|
||||
|
||||
let options = webclaw_core::ExtractionOptions {
|
||||
include_selectors: include_selectors.to_vec(),
|
||||
exclude_selectors: exclude_selectors.to_vec(),
|
||||
only_main_content,
|
||||
include_raw_html: false,
|
||||
};
|
||||
let extraction =
|
||||
webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
|
||||
.map_err(|e| format!("Extraction failed: {e}"))?;
|
||||
|
||||
if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
|
||||
info!(
|
||||
url,
|
||||
word_count = extraction.metadata.word_count,
|
||||
html_len = fetch_result.html.len(),
|
||||
"JS-rendered page detected, falling back to cloud API"
|
||||
);
|
||||
return cloud_scrape_fallback(
|
||||
cloud,
|
||||
url,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
formats,
|
||||
)
|
||||
.await;
|
||||
}
|
||||
|
||||
Ok(SmartFetchResult::Local(Box::new(extraction)))
|
||||
}
|
||||
|
||||
async fn cloud_scrape_fallback(
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
formats: &[&str],
|
||||
) -> Result<SmartFetchResult, String> {
|
||||
let Some(c) = cloud else {
|
||||
return Err(CloudError::NotConfigured.to_string());
|
||||
};
|
||||
let resp = c
|
||||
.scrape(
|
||||
url,
|
||||
formats,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
)
|
||||
.await
|
||||
.map_err(|e| e.to_string())?;
|
||||
info!(url, "cloud API fallback successful");
|
||||
Ok(SmartFetchResult::Cloud(resp))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Smart-fetch-HTML: for vertical extractors
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Where the HTML ultimately came from — useful for callers that want
|
||||
/// to track "did we fall back?" for logging or pricing.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub enum FetchSource {
|
||||
Local,
|
||||
Cloud,
|
||||
}
|
||||
|
||||
/// Antibot-aware HTML fetch result. The `html` field is always populated.
|
||||
pub struct FetchedHtml {
|
||||
pub html: String,
|
||||
pub final_url: String,
|
||||
pub source: FetchSource,
|
||||
}
|
||||
|
||||
/// Try local fetch; on bot protection, escalate to the cloud's
|
||||
/// `/v1/scrape` with `formats=["html"]` and return the raw HTML.
|
||||
///
|
||||
/// Designed for the vertical-extractor pattern where the caller has
|
||||
/// its own parser and just needs bytes.
|
||||
pub async fn smart_fetch_html(
|
||||
client: &dyn crate::fetcher::Fetcher,
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
) -> Result<FetchedHtml, CloudError> {
|
||||
let resp = client
|
||||
.fetch(url)
|
||||
.await
|
||||
.map_err(|e| CloudError::Network(e.to_string()))?;
|
||||
|
||||
if !is_bot_protected(&resp.html, &resp.headers) {
|
||||
return Ok(FetchedHtml {
|
||||
html: resp.html,
|
||||
final_url: resp.url,
|
||||
source: FetchSource::Local,
|
||||
});
|
||||
}
|
||||
|
||||
let Some(c) = cloud else {
|
||||
warn!(url, "bot protection detected + no cloud client configured");
|
||||
return Err(CloudError::NotConfigured);
|
||||
};
|
||||
debug!(url, "bot protection detected, escalating to cloud");
|
||||
let html = c.fetch_html(url).await?;
|
||||
Ok(FetchedHtml {
|
||||
html,
|
||||
final_url: url.to_string(),
|
||||
source: FetchSource::Cloud,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Tests
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
fn empty_headers() -> HeaderMap {
|
||||
HeaderMap::new()
|
||||
}
|
||||
|
||||
// --- detectors ----------------------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_detects_cloudflare_challenge() {
|
||||
let html = "<html><body>_cf_chl_opt loaded</body></html>";
|
||||
assert!(is_bot_protected(html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_detects_turnstile_on_short_page() {
|
||||
let html = "<div class=\"cf-turnstile\"></div>";
|
||||
assert!(is_bot_protected(html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_ignores_turnstile_on_real_content() {
|
||||
let html = format!(
|
||||
"<html><body>{}<div class=\"cf-turnstile\"></div></body></html>",
|
||||
"lots of real content ".repeat(8_000)
|
||||
);
|
||||
assert!(!is_bot_protected(&html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_detects_aws_waf_verifying_connection() {
|
||||
// The exact shape Trustpilot serves under AWS WAF.
|
||||
let html = r#"<div class="container"><div id="loading-state">
|
||||
<div class="interstitial-spinner" id="spinner"></div>
|
||||
<h1>Verifying your connection...</h1></div></div>"#;
|
||||
assert!(is_bot_protected(html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn synthesize_html_embeds_jsonld_and_og_tags() {
|
||||
let resp = json!({
|
||||
"url": "https://example.com/p/1",
|
||||
"metadata": {
|
||||
"title": "My Product",
|
||||
"description": "A nice thing.",
|
||||
"image": "https://cdn.example.com/1.jpg",
|
||||
"site_name": "Example Shop"
|
||||
},
|
||||
"structured_data": [
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"Widget","offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
|
||||
],
|
||||
"markdown": "# Widget\n\nA nice widget."
|
||||
});
|
||||
let html = synthesize_html(&resp);
|
||||
// OG tags from metadata.
|
||||
assert!(html.contains(r#"<meta property="og:title" content="My Product">"#));
|
||||
assert!(
|
||||
html.contains(r#"<meta property="og:image" content="https://cdn.example.com/1.jpg">"#)
|
||||
);
|
||||
// JSON-LD block preserved losslessly.
|
||||
assert!(html.contains(r#"<script type="application/ld+json">"#));
|
||||
assert!(html.contains(r#""@type":"Product""#));
|
||||
assert!(html.contains(r#""price":"9.99""#));
|
||||
// Body carries markdown.
|
||||
assert!(html.contains("A nice widget."));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn synthesize_html_handles_missing_fields_gracefully() {
|
||||
let resp = json!({"url": "https://example.com", "metadata": {}});
|
||||
let html = synthesize_html(&resp);
|
||||
// No panic, no stray unclosed tags.
|
||||
assert!(html.starts_with("<html><head>"));
|
||||
assert!(html.ends_with("</body></html>"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn synthesize_html_escapes_attribute_quotes() {
|
||||
let resp = json!({
|
||||
"metadata": {"title": r#"She said "hi""#}
|
||||
});
|
||||
let html = synthesize_html(&resp);
|
||||
assert!(html.contains(r#"og:title" content="She said "hi"""#));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_bot_protected_ignores_phrase_on_real_content() {
|
||||
// A real article that happens to mention the phrase in prose
|
||||
// should not trigger the short-page detector.
|
||||
let html = format!(
|
||||
"<html><body>{}<p>Verifying your connection is tricky.</p></body></html>",
|
||||
"article text ".repeat(2_000)
|
||||
);
|
||||
assert!(!is_bot_protected(&html, &empty_headers()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn needs_js_rendering_flags_spa_skeleton() {
|
||||
let html = format!(
|
||||
"<html><body><div id=\"__next\"></div>{}</body></html>",
|
||||
"<script>x</script>".repeat(500)
|
||||
);
|
||||
assert!(needs_js_rendering(10, &html));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn needs_js_rendering_passes_real_article() {
|
||||
let html = format!(
|
||||
"<html><body>{}<script>x</script></body></html>",
|
||||
"Real article text ".repeat(5_000)
|
||||
);
|
||||
assert!(!needs_js_rendering(5_000, &html));
|
||||
}
|
||||
|
||||
// --- CloudError mapping -------------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn cloud_error_maps_401() {
|
||||
let e = CloudError::from_status_and_body(401, "invalid key".into());
|
||||
assert!(matches!(e, CloudError::Unauthorized));
|
||||
assert!(e.to_string().contains(KEYS_URL));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cloud_error_maps_402() {
|
||||
let e = CloudError::from_status_and_body(402, "{}".into());
|
||||
assert!(matches!(e, CloudError::InsufficientPlan));
|
||||
assert!(e.to_string().contains(PRICING_URL));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cloud_error_maps_429() {
|
||||
let e = CloudError::from_status_and_body(429, "slow down".into());
|
||||
assert!(matches!(e, CloudError::RateLimited));
|
||||
assert!(e.to_string().contains(PRICING_URL));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cloud_error_maps_generic_5xx() {
|
||||
let e = CloudError::from_status_and_body(503, "x".repeat(2000));
|
||||
match e {
|
||||
CloudError::ServerError { status, body } => {
|
||||
assert_eq!(status, 503);
|
||||
assert!(body.len() <= 500);
|
||||
}
|
||||
_ => panic!("expected ServerError"),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn not_configured_error_points_at_signup() {
|
||||
let msg = CloudError::NotConfigured.to_string();
|
||||
assert!(msg.contains(SIGNUP_URL));
|
||||
assert!(msg.contains("WEBCLAW_API_KEY"));
|
||||
}
|
||||
|
||||
// --- CloudClient construction ------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn cloud_client_explicit_key_wins_over_env() {
|
||||
// SAFETY: this test mutates process env. Serial tests only.
|
||||
// Set env to something, pass an explicit key, explicit should win.
|
||||
// (We don't actually *call* the API, just check the struct stored
|
||||
// the right key.)
|
||||
// rustc std::env::set_var is unsafe in newer toolchains.
|
||||
unsafe {
|
||||
std::env::set_var("WEBCLAW_API_KEY", "from-env");
|
||||
}
|
||||
let client = CloudClient::new(Some("from-flag")).expect("client built");
|
||||
assert_eq!(client.api_key, "from-flag");
|
||||
unsafe {
|
||||
std::env::remove_var("WEBCLAW_API_KEY");
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cloud_client_none_when_empty() {
|
||||
unsafe {
|
||||
std::env::remove_var("WEBCLAW_API_KEY");
|
||||
}
|
||||
assert!(CloudClient::new(None).is_none());
|
||||
assert!(CloudClient::new(Some("")).is_none());
|
||||
assert!(CloudClient::new(Some(" ")).is_none());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cloud_client_base_url_strips_trailing_slash() {
|
||||
let c = CloudClient::with_key_and_base("k", "https://api.example.com/v1/");
|
||||
assert_eq!(c.base_url(), "https://api.example.com/v1");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn truncate_respects_char_boundaries() {
|
||||
// Ensure we don't slice inside a multi-byte char.
|
||||
let s = "a".repeat(10) + "é"; // é is 2 bytes
|
||||
let out = truncate(&s, 11);
|
||||
assert_eq!(out.chars().count(), 11);
|
||||
}
|
||||
}
|
||||
452
crates/webclaw-fetch/src/extractors/amazon_product.rs
Normal file
452
crates/webclaw-fetch/src/extractors/amazon_product.rs
Normal file
|
|
@ -0,0 +1,452 @@
|
|||
//! Amazon product detail page extractor.
|
||||
//!
|
||||
//! Amazon product pages (`/dp/{ASIN}/` on every locale) are
|
||||
//! inconsistently protected. Sometimes our local TLS fingerprint gets
|
||||
//! a real HTML page; sometimes we land on a CAPTCHA interstitial;
|
||||
//! sometimes we land on a real page that for whatever reason ships
|
||||
//! no Product JSON-LD (Amazon A/B-tests this regularly). So the
|
||||
//! extractor has a two-stage fallback:
|
||||
//!
|
||||
//! 1. Try local fetch + parse. If we got Product JSON-LD back, great:
|
||||
//! we have everything (title, brand, price, availability, rating).
|
||||
//! 2. If local fetch worked *but the page has no Product JSON-LD* AND
|
||||
//! a cloud client is configured, force-escalate to api.webclaw.io.
|
||||
//! Cloud's render + antibot pipeline reliably surfaces the
|
||||
//! structured data. Without a cloud client we return whatever we
|
||||
//! got from local (usually just title via `#productTitle` or OG
|
||||
//! meta tags).
|
||||
//!
|
||||
//! Parsing tries JSON-LD first, DOM regex (`#productTitle`,
|
||||
//! `#landingImage`) second, OG `<meta>` tags third. The OG path
|
||||
//! matters because the cloud's synthesized HTML ships metadata as
|
||||
//! OG tags but lacks Amazon's DOM IDs.
|
||||
//!
|
||||
//! Auto-dispatch: we accept any amazon.* host with a `/dp/{ASIN}/`
|
||||
//! path. ASINs are a stable Amazon identifier so we extract that as
|
||||
//! part of the response even when everything else is empty (tells
|
||||
//! callers the URL was at least recognised).
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "amazon_product",
|
||||
label: "Amazon product",
|
||||
description: "Returns product detail: title, brand, price, currency, availability, rating, image, ASIN. Requires WEBCLAW_API_KEY — Amazon's antibot means we always go through the cloud.",
|
||||
url_patterns: &[
|
||||
"https://www.amazon.com/dp/{ASIN}",
|
||||
"https://www.amazon.co.uk/dp/{ASIN}",
|
||||
"https://www.amazon.de/dp/{ASIN}",
|
||||
"https://www.amazon.fr/dp/{ASIN}",
|
||||
"https://www.amazon.it/dp/{ASIN}",
|
||||
"https://www.amazon.es/dp/{ASIN}",
|
||||
"https://www.amazon.co.jp/dp/{ASIN}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !is_amazon_host(host) {
|
||||
return false;
|
||||
}
|
||||
parse_asin(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let asin = parse_asin(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("amazon_product: no ASIN in '{url}'")))?;
|
||||
|
||||
let mut fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
// Amazon ships Product JSON-LD inconsistently even on non-CAPTCHA
|
||||
// pages (they A/B-test it). When local fetch succeeded but has no
|
||||
// Product JSON-LD, force-escalate to the cloud which runs the
|
||||
// render pipeline and reliably surfaces structured data. No-op
|
||||
// when cloud isn't configured — we return whatever local gave us.
|
||||
if fetched.source == cloud::FetchSource::Local
|
||||
&& find_product_jsonld(&fetched.html).is_none()
|
||||
&& let Some(c) = client.cloud()
|
||||
{
|
||||
match c.fetch_html(url).await {
|
||||
Ok(cloud_html) => {
|
||||
fetched = cloud::FetchedHtml {
|
||||
html: cloud_html,
|
||||
final_url: url.to_string(),
|
||||
source: cloud::FetchSource::Cloud,
|
||||
};
|
||||
}
|
||||
Err(e) => {
|
||||
tracing::debug!(
|
||||
error = %e,
|
||||
"amazon_product: cloud escalation failed, keeping local"
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let mut data = parse(&fetched.html, url, &asin);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
/// Pure parser. Given HTML (from anywhere — direct, cloud, or a fixture
|
||||
/// file) and the source URL, extract Amazon product detail. Returns a
|
||||
/// `Value` rather than a typed struct so callers can pass it through
|
||||
/// without carrying webclaw_fetch types.
|
||||
pub fn parse(html: &str, url: &str, asin: &str) -> Value {
|
||||
let jsonld = find_product_jsonld(html);
|
||||
// Three-tier title: JSON-LD `name` > Amazon's `#productTitle` span
|
||||
// (only present on real static HTML) > cloud-synthesized og:title.
|
||||
let title = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "name"))
|
||||
.or_else(|| dom_title(html))
|
||||
.or_else(|| og(html, "title"));
|
||||
let image = jsonld
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| dom_image(html))
|
||||
.or_else(|| og(html, "image"));
|
||||
let brand = jsonld.as_ref().and_then(get_brand);
|
||||
let description = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description"));
|
||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||
let offer = jsonld.as_ref().and_then(first_offer);
|
||||
|
||||
let sku = jsonld.as_ref().and_then(|v| get_text(v, "sku"));
|
||||
let mpn = jsonld.as_ref().and_then(|v| get_text(v, "mpn"));
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"asin": asin,
|
||||
"title": title,
|
||||
"brand": brand,
|
||||
"description": description,
|
||||
"image": image,
|
||||
"price": offer.as_ref().and_then(|o| get_text(o, "price")),
|
||||
"currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
|
||||
"availability": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "availability").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"condition": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "itemCondition").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"sku": sku,
|
||||
"mpn": mpn,
|
||||
"aggregate_rating": aggregate_rating,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn is_amazon_host(host: &str) -> bool {
|
||||
host.starts_with("www.amazon.") || host.starts_with("amazon.")
|
||||
}
|
||||
|
||||
/// Pull a 10-char ASIN out of any recognised Amazon URL shape:
|
||||
/// - /dp/{ASIN}
|
||||
/// - /gp/product/{ASIN}
|
||||
/// - /product/{ASIN}
|
||||
/// - /exec/obidos/ASIN/{ASIN}
|
||||
fn parse_asin(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r"/(?:dp|gp/product|product|ASIN)/([A-Z0-9]{10})(?:[/?#]|$)").unwrap()
|
||||
});
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers — light reuse of ecommerce_product's style
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_product_jsonld(html: &str) -> Option<Value> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(&b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let Some(t) = v.get("@type") else {
|
||||
return false;
|
||||
};
|
||||
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
|
||||
match t {
|
||||
Value::String(s) => is_prod(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
brand
|
||||
.as_object()
|
||||
.and_then(|o| o.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
}
|
||||
|
||||
fn get_first_image(v: &Value) -> Option<String> {
|
||||
match v.get("image")? {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Array(arr) => arr.iter().find_map(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}),
|
||||
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn first_offer(v: &Value) -> Option<Value> {
|
||||
let offers = v.get("offers")?;
|
||||
match offers {
|
||||
Value::Array(arr) => arr.first().cloned(),
|
||||
Value::Object(_) => Some(offers.clone()),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// DOM fallbacks — cheap regex for the two fields most likely to be
|
||||
// missing from JSON-LD on Amazon.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn dom_title(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r#"(?s)id="productTitle"[^>]*>([^<]+)<"#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().trim().to_string())
|
||||
}
|
||||
|
||||
fn dom_image(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r#"id="landingImage"[^>]+src="([^"]+)""#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// OG meta tag lookup. Cloud-synthesized HTML ships these even when
|
||||
/// JSON-LD and Amazon-DOM-IDs are both absent, so they're the last
|
||||
/// line of defence for `title`, `image`, `description`.
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| html_unescape(m.as_str()));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Undo the synthesize_html attribute escaping for the few entities it
|
||||
/// emits. Keeps us off a heavier HTML-entity dep.
|
||||
fn html_unescape(s: &str) -> String {
|
||||
s.replace(""", "\"")
|
||||
.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_multi_locale() {
|
||||
assert!(matches("https://www.amazon.com/dp/B0CHX1W1XY"));
|
||||
assert!(matches("https://www.amazon.co.uk/dp/B0CHX1W1XY/"));
|
||||
assert!(matches("https://www.amazon.de/dp/B0CHX1W1XY?psc=1"));
|
||||
assert!(matches(
|
||||
"https://www.amazon.com/gp/product/B0CHX1W1XY/ref=foo"
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_product_urls() {
|
||||
assert!(!matches("https://www.amazon.com/"));
|
||||
assert!(!matches("https://www.amazon.com/gp/cart"));
|
||||
assert!(!matches("https://example.com/dp/B0CHX1W1XY"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_asin_extracts_from_multiple_shapes() {
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY/"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/dp/B0CHX1W1XY?psc=1"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/gp/product/B0CHX1W1XY/ref=bar"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_asin("https://www.amazon.com/exec/obidos/ASIN/B0CHX1W1XY/baz"),
|
||||
Some("B0CHX1W1XY".into())
|
||||
);
|
||||
assert_eq!(parse_asin("https://www.amazon.com/"), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_extracts_from_fixture_jsonld() {
|
||||
// Minimal Amazon-style fixture with a Product JSON-LD block.
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"ACME Widget","sku":"B0CHX1W1XY",
|
||||
"brand":{"@type":"Brand","name":"ACME"},
|
||||
"image":"https://m.media-amazon.com/images/I/abc.jpg",
|
||||
"offers":{"@type":"Offer","price":"19.99","priceCurrency":"USD",
|
||||
"availability":"https://schema.org/InStock"},
|
||||
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.6","reviewCount":"1234"}}
|
||||
</script>
|
||||
</head><body></body></html>"##;
|
||||
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
|
||||
assert_eq!(v["asin"], "B0CHX1W1XY");
|
||||
assert_eq!(v["title"], "ACME Widget");
|
||||
assert_eq!(v["brand"], "ACME");
|
||||
assert_eq!(v["price"], "19.99");
|
||||
assert_eq!(v["currency"], "USD");
|
||||
assert_eq!(v["availability"], "InStock");
|
||||
assert_eq!(v["aggregate_rating"]["rating_value"], "4.6");
|
||||
assert_eq!(v["aggregate_rating"]["review_count"], "1234");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_dom_when_jsonld_missing_fields() {
|
||||
let html = r#"
|
||||
<html><body>
|
||||
<span id="productTitle">Fallback Title</span>
|
||||
<img id="landingImage" src="https://m.media-amazon.com/images/I/fallback.jpg" />
|
||||
</body></html>
|
||||
"#;
|
||||
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
|
||||
assert_eq!(v["title"], "Fallback Title");
|
||||
assert_eq!(
|
||||
v["image"],
|
||||
"https://m.media-amazon.com/images/I/fallback.jpg"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_og_meta_when_no_jsonld_no_dom() {
|
||||
// Shape we see from the cloud synthesize_html path: OG tags
|
||||
// only, no JSON-LD, no Amazon DOM IDs.
|
||||
let html = r##"<html><head>
|
||||
<meta property="og:title" content="Cloud-sourced MacBook Pro">
|
||||
<meta property="og:image" content="https://m.media-amazon.com/images/I/cloud.jpg">
|
||||
<meta property="og:description" content="Via api.webclaw.io">
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.amazon.com/dp/B0CHX1W1XY", "B0CHX1W1XY");
|
||||
assert_eq!(v["title"], "Cloud-sourced MacBook Pro");
|
||||
assert_eq!(v["image"], "https://m.media-amazon.com/images/I/cloud.jpg");
|
||||
assert_eq!(v["description"], "Via api.webclaw.io");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn og_unescape_handles_quot_entity() {
|
||||
let html = r#"<meta property="og:title" content="Apple "M2 Pro" Laptop">"#;
|
||||
assert_eq!(
|
||||
og(html, "title").as_deref(),
|
||||
Some(r#"Apple "M2 Pro" Laptop"#)
|
||||
);
|
||||
}
|
||||
}
|
||||
314
crates/webclaw-fetch/src/extractors/arxiv.rs
Normal file
314
crates/webclaw-fetch/src/extractors/arxiv.rs
Normal file
|
|
@ -0,0 +1,314 @@
|
|||
//! ArXiv paper structured extractor.
|
||||
//!
|
||||
//! Uses the public ArXiv API at `export.arxiv.org/api/query?id_list={id}`
|
||||
//! which returns Atom XML. We parse just enough to surface title, authors,
|
||||
//! abstract, categories, and the canonical PDF link. No HTML scraping
|
||||
//! required and no auth.
|
||||
|
||||
use quick_xml::Reader;
|
||||
use quick_xml::events::Event;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "arxiv",
|
||||
label: "ArXiv paper",
|
||||
description: "Returns paper metadata: title, authors, abstract, categories, primary category, PDF URL.",
|
||||
url_patterns: &[
|
||||
"https://arxiv.org/abs/{id}",
|
||||
"https://arxiv.org/abs/{id}v{n}",
|
||||
"https://arxiv.org/pdf/{id}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "arxiv.org" && host != "www.arxiv.org" {
|
||||
return false;
|
||||
}
|
||||
url.contains("/abs/") || url.contains("/pdf/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let id = parse_id(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("arxiv: cannot parse id from '{url}'")))?;
|
||||
|
||||
let api_url = format!("https://export.arxiv.org/api/query?id_list={id}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"arxiv api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let entry = parse_atom_entry(&resp.html)
|
||||
.ok_or_else(|| FetchError::BodyDecode("arxiv: no <entry> in response".into()))?;
|
||||
if entry.title.is_none() && entry.summary.is_none() {
|
||||
return Err(FetchError::BodyDecode(format!(
|
||||
"arxiv: paper '{id}' returned empty entry (likely withdrawn or invalid id)"
|
||||
)));
|
||||
}
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"id": id,
|
||||
"arxiv_id": entry.id,
|
||||
"title": entry.title,
|
||||
"authors": entry.authors,
|
||||
"abstract": entry.summary.map(|s| collapse_whitespace(&s)),
|
||||
"published": entry.published,
|
||||
"updated": entry.updated,
|
||||
"primary_category": entry.primary_category,
|
||||
"categories": entry.categories,
|
||||
"doi": entry.doi,
|
||||
"comment": entry.comment,
|
||||
"pdf_url": entry.pdf_url,
|
||||
"abs_url": entry.abs_url,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Parse an arxiv id from a URL. Strips the version suffix (`v2`, `v3`)
|
||||
/// and the `.pdf` extension when present.
|
||||
fn parse_id(url: &str) -> Option<String> {
|
||||
let after = url
|
||||
.split("/abs/")
|
||||
.nth(1)
|
||||
.or_else(|| url.split("/pdf/").nth(1))?;
|
||||
let stripped = after
|
||||
.split(['?', '#'])
|
||||
.next()?
|
||||
.trim_end_matches('/')
|
||||
.trim_end_matches(".pdf");
|
||||
// Strip optional version suffix, e.g. "2401.12345v2" → "2401.12345"
|
||||
let no_version = match stripped.rfind('v') {
|
||||
Some(i) if stripped[i + 1..].chars().all(|c| c.is_ascii_digit()) => &stripped[..i],
|
||||
_ => stripped,
|
||||
};
|
||||
if no_version.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(no_version.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
fn collapse_whitespace(s: &str) -> String {
|
||||
s.split_whitespace().collect::<Vec<_>>().join(" ")
|
||||
}
|
||||
|
||||
#[derive(Default)]
|
||||
struct AtomEntry {
|
||||
id: Option<String>,
|
||||
title: Option<String>,
|
||||
summary: Option<String>,
|
||||
published: Option<String>,
|
||||
updated: Option<String>,
|
||||
primary_category: Option<String>,
|
||||
categories: Vec<String>,
|
||||
authors: Vec<String>,
|
||||
doi: Option<String>,
|
||||
comment: Option<String>,
|
||||
pdf_url: Option<String>,
|
||||
abs_url: Option<String>,
|
||||
}
|
||||
|
||||
/// Parse the first `<entry>` block of an ArXiv Atom feed.
|
||||
fn parse_atom_entry(xml: &str) -> Option<AtomEntry> {
|
||||
let mut reader = Reader::from_str(xml);
|
||||
let mut buf = Vec::new();
|
||||
|
||||
// States
|
||||
let mut in_entry = false;
|
||||
let mut current: Option<&'static str> = None;
|
||||
let mut in_author = false;
|
||||
let mut in_author_name = false;
|
||||
let mut entry = AtomEntry::default();
|
||||
|
||||
loop {
|
||||
match reader.read_event_into(&mut buf) {
|
||||
Ok(Event::Start(ref e)) => {
|
||||
let local = e.local_name();
|
||||
match local.as_ref() {
|
||||
b"entry" => in_entry = true,
|
||||
b"id" if in_entry && !in_author => current = Some("id"),
|
||||
b"title" if in_entry => current = Some("title"),
|
||||
b"summary" if in_entry => current = Some("summary"),
|
||||
b"published" if in_entry => current = Some("published"),
|
||||
b"updated" if in_entry => current = Some("updated"),
|
||||
b"author" if in_entry => in_author = true,
|
||||
b"name" if in_author => {
|
||||
in_author_name = true;
|
||||
current = Some("author_name");
|
||||
}
|
||||
b"category" if in_entry => {
|
||||
// primary_category is namespaced (arxiv:primary_category)
|
||||
// category is plain. quick-xml gives us local-name only,
|
||||
// so we treat both as categories and take the first as
|
||||
// primary.
|
||||
for attr in e.attributes().flatten() {
|
||||
if attr.key.as_ref() == b"term"
|
||||
&& let Ok(v) = attr.unescape_value()
|
||||
{
|
||||
let term = v.to_string();
|
||||
if entry.primary_category.is_none() {
|
||||
entry.primary_category = Some(term.clone());
|
||||
}
|
||||
entry.categories.push(term);
|
||||
}
|
||||
}
|
||||
}
|
||||
b"link" if in_entry => {
|
||||
let mut href = None;
|
||||
let mut rel = None;
|
||||
let mut typ = None;
|
||||
for attr in e.attributes().flatten() {
|
||||
match attr.key.as_ref() {
|
||||
b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
if let Some(h) = href {
|
||||
if typ.as_deref() == Some("application/pdf") {
|
||||
entry.pdf_url = Some(h.clone());
|
||||
}
|
||||
if rel.as_deref() == Some("alternate") {
|
||||
entry.abs_url = Some(h);
|
||||
}
|
||||
}
|
||||
}
|
||||
_ => current = None,
|
||||
}
|
||||
}
|
||||
Ok(Event::Empty(ref e)) => {
|
||||
// Self-closing tags (<link href="..." />). Same handling as Start.
|
||||
let local = e.local_name();
|
||||
if (local.as_ref() == b"link" || local.as_ref() == b"category") && in_entry {
|
||||
let mut href = None;
|
||||
let mut rel = None;
|
||||
let mut typ = None;
|
||||
let mut term = None;
|
||||
for attr in e.attributes().flatten() {
|
||||
match attr.key.as_ref() {
|
||||
b"href" => href = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
b"rel" => rel = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
b"type" => typ = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
b"term" => term = attr.unescape_value().ok().map(|s| s.to_string()),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
if let Some(t) = term {
|
||||
if entry.primary_category.is_none() {
|
||||
entry.primary_category = Some(t.clone());
|
||||
}
|
||||
entry.categories.push(t);
|
||||
}
|
||||
if let Some(h) = href {
|
||||
if typ.as_deref() == Some("application/pdf") {
|
||||
entry.pdf_url = Some(h.clone());
|
||||
}
|
||||
if rel.as_deref() == Some("alternate") {
|
||||
entry.abs_url = Some(h);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(Event::Text(ref e)) => {
|
||||
if let (Some(field), Ok(text)) = (current, e.unescape()) {
|
||||
let text = text.to_string();
|
||||
match field {
|
||||
"id" => entry.id = Some(text.trim().to_string()),
|
||||
"title" => entry.title = append_text(entry.title.take(), &text),
|
||||
"summary" => entry.summary = append_text(entry.summary.take(), &text),
|
||||
"published" => entry.published = Some(text.trim().to_string()),
|
||||
"updated" => entry.updated = Some(text.trim().to_string()),
|
||||
"author_name" => entry.authors.push(text.trim().to_string()),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(Event::End(ref e)) => {
|
||||
let local = e.local_name();
|
||||
match local.as_ref() {
|
||||
b"entry" => break,
|
||||
b"author" => in_author = false,
|
||||
b"name" => in_author_name = false,
|
||||
_ => {}
|
||||
}
|
||||
if !in_author_name {
|
||||
current = None;
|
||||
}
|
||||
}
|
||||
Ok(Event::Eof) => break,
|
||||
Err(_) => return None,
|
||||
_ => {}
|
||||
}
|
||||
buf.clear();
|
||||
}
|
||||
|
||||
if in_entry { Some(entry) } else { None }
|
||||
}
|
||||
|
||||
/// Concatenate text fragments (long fields can be split across multiple
|
||||
/// text events if they contain entities or CDATA).
|
||||
fn append_text(prev: Option<String>, next: &str) -> Option<String> {
|
||||
match prev {
|
||||
Some(mut s) => {
|
||||
s.push_str(next);
|
||||
Some(s)
|
||||
}
|
||||
None => Some(next.to_string()),
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_arxiv_urls() {
|
||||
assert!(matches("https://arxiv.org/abs/2401.12345"));
|
||||
assert!(matches("https://arxiv.org/abs/2401.12345v2"));
|
||||
assert!(matches("https://arxiv.org/pdf/2401.12345.pdf"));
|
||||
assert!(!matches("https://arxiv.org/"));
|
||||
assert!(!matches("https://example.com/abs/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_id_strips_version_and_extension() {
|
||||
assert_eq!(
|
||||
parse_id("https://arxiv.org/abs/2401.12345"),
|
||||
Some("2401.12345".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_id("https://arxiv.org/abs/2401.12345v3"),
|
||||
Some("2401.12345".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_id("https://arxiv.org/pdf/2401.12345v2.pdf"),
|
||||
Some("2401.12345".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn collapse_whitespace_handles_newlines_and_tabs() {
|
||||
assert_eq!(collapse_whitespace("a b\n\tc "), "a b c");
|
||||
}
|
||||
}
|
||||
168
crates/webclaw-fetch/src/extractors/crates_io.rs
Normal file
168
crates/webclaw-fetch/src/extractors/crates_io.rs
Normal file
|
|
@ -0,0 +1,168 @@
|
|||
//! crates.io structured extractor.
|
||||
//!
|
||||
//! Uses the public JSON API at `crates.io/api/v1/crates/{name}`. No
|
||||
//! auth, no rate limit at normal usage. The response includes both
|
||||
//! the crate metadata and the full version list, which we summarize
|
||||
//! down to a count + latest release info to keep the payload small.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "crates_io",
|
||||
label: "crates.io package",
|
||||
description: "Returns crate metadata: latest version, dependencies, downloads, license, repository.",
|
||||
url_patterns: &[
|
||||
"https://crates.io/crates/{name}",
|
||||
"https://crates.io/crates/{name}/{version}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "crates.io" && host != "www.crates.io" {
|
||||
return false;
|
||||
}
|
||||
url.contains("/crates/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let name = parse_name(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("crates.io: cannot parse name from '{url}'")))?;
|
||||
|
||||
let api_url = format!("https://crates.io/api/v1/crates/{name}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"crates.io: crate '{name}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"crates.io api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let body: CratesResponse = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("crates.io parse: {e}")))?;
|
||||
|
||||
let c = body.crate_;
|
||||
let latest_version = body
|
||||
.versions
|
||||
.iter()
|
||||
.find(|v| !v.yanked.unwrap_or(false))
|
||||
.or_else(|| body.versions.first());
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"name": c.id,
|
||||
"description": c.description,
|
||||
"homepage": c.homepage,
|
||||
"documentation": c.documentation,
|
||||
"repository": c.repository,
|
||||
"max_stable_version": c.max_stable_version,
|
||||
"max_version": c.max_version,
|
||||
"newest_version": c.newest_version,
|
||||
"downloads": c.downloads,
|
||||
"recent_downloads": c.recent_downloads,
|
||||
"categories": c.categories,
|
||||
"keywords": c.keywords,
|
||||
"release_count": body.versions.len(),
|
||||
"latest_release_date": latest_version.and_then(|v| v.created_at.clone()),
|
||||
"latest_license": latest_version.and_then(|v| v.license.clone()),
|
||||
"latest_rust_version": latest_version.and_then(|v| v.rust_version.clone()),
|
||||
"latest_yanked": latest_version.and_then(|v| v.yanked),
|
||||
"created_at": c.created_at,
|
||||
"updated_at": c.updated_at,
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_name(url: &str) -> Option<String> {
|
||||
let after = url.split("/crates/").nth(1)?;
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let first = stripped.split('/').find(|s| !s.is_empty())?;
|
||||
Some(first.to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// crates.io API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CratesResponse {
|
||||
#[serde(rename = "crate")]
|
||||
crate_: CrateInfo,
|
||||
#[serde(default)]
|
||||
versions: Vec<VersionInfo>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CrateInfo {
|
||||
id: Option<String>,
|
||||
description: Option<String>,
|
||||
homepage: Option<String>,
|
||||
documentation: Option<String>,
|
||||
repository: Option<String>,
|
||||
max_stable_version: Option<String>,
|
||||
max_version: Option<String>,
|
||||
newest_version: Option<String>,
|
||||
downloads: Option<i64>,
|
||||
recent_downloads: Option<i64>,
|
||||
#[serde(default)]
|
||||
categories: Vec<String>,
|
||||
#[serde(default)]
|
||||
keywords: Vec<String>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct VersionInfo {
|
||||
license: Option<String>,
|
||||
rust_version: Option<String>,
|
||||
yanked: Option<bool>,
|
||||
created_at: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_crate_pages() {
|
||||
assert!(matches("https://crates.io/crates/serde"));
|
||||
assert!(matches("https://crates.io/crates/tokio/1.45.0"));
|
||||
assert!(!matches("https://crates.io/"));
|
||||
assert!(!matches("https://example.com/crates/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_name_handles_versioned_urls() {
|
||||
assert_eq!(
|
||||
parse_name("https://crates.io/crates/serde"),
|
||||
Some("serde".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_name("https://crates.io/crates/tokio/1.45.0"),
|
||||
Some("tokio".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_name("https://crates.io/crates/scraper/?foo=bar"),
|
||||
Some("scraper".into())
|
||||
);
|
||||
}
|
||||
}
|
||||
188
crates/webclaw-fetch/src/extractors/dev_to.rs
Normal file
188
crates/webclaw-fetch/src/extractors/dev_to.rs
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
//! dev.to article structured extractor.
|
||||
//!
|
||||
//! `dev.to/api/articles/{username}/{slug}` returns the full article body,
|
||||
//! tags, reaction count, comment count, and reading time. Anonymous
|
||||
//! access works fine for published posts.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "dev_to",
|
||||
label: "dev.to article",
|
||||
description: "Returns article metadata + body: title, body markdown, tags, reactions, comments, reading time.",
|
||||
url_patterns: &["https://dev.to/{username}/{slug}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "dev.to" && host != "www.dev.to" {
|
||||
return false;
|
||||
}
|
||||
let path = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.and_then(|s| s.split_once('/'))
|
||||
.map(|(_, p)| p)
|
||||
.unwrap_or("");
|
||||
let stripped = path
|
||||
.split(['?', '#'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
// Need exactly /{username}/{slug}, with username starting with non-reserved.
|
||||
segs.len() == 2 && !RESERVED_FIRST_SEGS.contains(&segs[0])
|
||||
}
|
||||
|
||||
const RESERVED_FIRST_SEGS: &[&str] = &[
|
||||
"api",
|
||||
"tags",
|
||||
"search",
|
||||
"settings",
|
||||
"enter",
|
||||
"signup",
|
||||
"about",
|
||||
"code-of-conduct",
|
||||
"privacy",
|
||||
"terms",
|
||||
"contact",
|
||||
"sponsorships",
|
||||
"sponsors",
|
||||
"shop",
|
||||
"videos",
|
||||
"listings",
|
||||
"podcasts",
|
||||
"p",
|
||||
"t",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (username, slug) = parse_username_slug(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("dev_to: cannot parse username/slug from '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://dev.to/api/articles/{username}/{slug}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"dev_to: article '{username}/{slug}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"dev.to api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let a: Article = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("dev.to parse: {e}")))?;
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"id": a.id,
|
||||
"title": a.title,
|
||||
"description": a.description,
|
||||
"body_markdown": a.body_markdown,
|
||||
"url_canonical": a.canonical_url,
|
||||
"published_at": a.published_at,
|
||||
"edited_at": a.edited_at,
|
||||
"reading_time_min": a.reading_time_minutes,
|
||||
"tags": a.tag_list,
|
||||
"positive_reactions": a.positive_reactions_count,
|
||||
"public_reactions": a.public_reactions_count,
|
||||
"comments_count": a.comments_count,
|
||||
"page_views_count": a.page_views_count,
|
||||
"cover_image": a.cover_image,
|
||||
"author": json!({
|
||||
"username": a.user.as_ref().and_then(|u| u.username.clone()),
|
||||
"name": a.user.as_ref().and_then(|u| u.name.clone()),
|
||||
"twitter": a.user.as_ref().and_then(|u| u.twitter_username.clone()),
|
||||
"github": a.user.as_ref().and_then(|u| u.github_username.clone()),
|
||||
"website": a.user.as_ref().and_then(|u| u.website_url.clone()),
|
||||
}),
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_username_slug(url: &str) -> Option<(String, String)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let username = segs.next()?;
|
||||
let slug = segs.next()?;
|
||||
Some((username.to_string(), slug.to_string()))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// dev.to API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Article {
|
||||
id: Option<i64>,
|
||||
title: Option<String>,
|
||||
description: Option<String>,
|
||||
body_markdown: Option<String>,
|
||||
canonical_url: Option<String>,
|
||||
published_at: Option<String>,
|
||||
edited_at: Option<String>,
|
||||
reading_time_minutes: Option<i64>,
|
||||
tag_list: Option<serde_json::Value>, // string OR array depending on endpoint
|
||||
positive_reactions_count: Option<i64>,
|
||||
public_reactions_count: Option<i64>,
|
||||
comments_count: Option<i64>,
|
||||
page_views_count: Option<i64>,
|
||||
cover_image: Option<String>,
|
||||
user: Option<UserRef>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct UserRef {
|
||||
username: Option<String>,
|
||||
name: Option<String>,
|
||||
twitter_username: Option<String>,
|
||||
github_username: Option<String>,
|
||||
website_url: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_article_urls() {
|
||||
assert!(matches("https://dev.to/ben/welcome-thread"));
|
||||
assert!(matches("https://dev.to/0xmassi/some-post-1abc"));
|
||||
assert!(!matches("https://dev.to/"));
|
||||
assert!(!matches("https://dev.to/api/articles/foo/bar"));
|
||||
assert!(!matches("https://dev.to/tags/rust"));
|
||||
assert!(!matches("https://dev.to/ben")); // user profile, not article
|
||||
assert!(!matches("https://example.com/ben/post"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_pulls_username_and_slug() {
|
||||
assert_eq!(
|
||||
parse_username_slug("https://dev.to/ben/welcome-thread"),
|
||||
Some(("ben".into(), "welcome-thread".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_username_slug("https://dev.to/0xmassi/some-post-1abc/?foo=bar"),
|
||||
Some(("0xmassi".into(), "some-post-1abc".into()))
|
||||
);
|
||||
}
|
||||
}
|
||||
150
crates/webclaw-fetch/src/extractors/docker_hub.rs
Normal file
150
crates/webclaw-fetch/src/extractors/docker_hub.rs
Normal file
|
|
@ -0,0 +1,150 @@
|
|||
//! Docker Hub repository structured extractor.
|
||||
//!
|
||||
//! Uses the v2 JSON API at `hub.docker.com/v2/repositories/{namespace}/{name}`.
|
||||
//! Anonymous access is allowed for public images. The official-image
|
||||
//! shorthand (e.g. `nginx`, `redis`) is normalized to `library/{name}`.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "docker_hub",
|
||||
label: "Docker Hub repository",
|
||||
description: "Returns image metadata: pull count, star count, last_updated, official flag, description.",
|
||||
url_patterns: &[
|
||||
"https://hub.docker.com/_/{name}",
|
||||
"https://hub.docker.com/r/{namespace}/{name}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "hub.docker.com" {
|
||||
return false;
|
||||
}
|
||||
url.contains("/_/") || url.contains("/r/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (namespace, name) = parse_repo(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("docker_hub: cannot parse repo from '{url}'")))?;
|
||||
|
||||
let api_url = format!("https://hub.docker.com/v2/repositories/{namespace}/{name}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"docker_hub: repo '{namespace}/{name}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"docker_hub api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let r: RepoResponse = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("docker_hub parse: {e}")))?;
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"namespace": r.namespace,
|
||||
"name": r.name,
|
||||
"full_name": format!("{namespace}/{name}"),
|
||||
"pull_count": r.pull_count,
|
||||
"star_count": r.star_count,
|
||||
"description": r.description,
|
||||
"full_description": r.full_description,
|
||||
"last_updated": r.last_updated,
|
||||
"date_registered": r.date_registered,
|
||||
"is_official": namespace == "library",
|
||||
"is_private": r.is_private,
|
||||
"status_description":r.status_description,
|
||||
"categories": r.categories,
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Parse `(namespace, name)` from a Docker Hub URL. The official-image
|
||||
/// shorthand `/_/nginx` maps to `(library, nginx)`. Personal repos
|
||||
/// `/r/foo/bar` map to `(foo, bar)`.
|
||||
fn parse_repo(url: &str) -> Option<(String, String)> {
|
||||
if let Some(after) = url.split("/_/").nth(1) {
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let name = stripped.split('/').next().filter(|s| !s.is_empty())?;
|
||||
return Some(("library".into(), name.to_string()));
|
||||
}
|
||||
let after = url.split("/r/").nth(1)?;
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let ns = segs.next()?;
|
||||
let nm = segs.next()?;
|
||||
Some((ns.to_string(), nm.to_string()))
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct RepoResponse {
|
||||
namespace: Option<String>,
|
||||
name: Option<String>,
|
||||
pull_count: Option<i64>,
|
||||
star_count: Option<i64>,
|
||||
description: Option<String>,
|
||||
full_description: Option<String>,
|
||||
last_updated: Option<String>,
|
||||
date_registered: Option<String>,
|
||||
is_private: Option<bool>,
|
||||
status_description: Option<String>,
|
||||
#[serde(default)]
|
||||
categories: Vec<DockerCategory>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, serde::Serialize)]
|
||||
struct DockerCategory {
|
||||
name: Option<String>,
|
||||
slug: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_docker_urls() {
|
||||
assert!(matches("https://hub.docker.com/_/nginx"));
|
||||
assert!(matches("https://hub.docker.com/r/grafana/grafana"));
|
||||
assert!(!matches("https://hub.docker.com/"));
|
||||
assert!(!matches("https://example.com/_/nginx"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_repo_handles_official_and_personal() {
|
||||
assert_eq!(
|
||||
parse_repo("https://hub.docker.com/_/nginx"),
|
||||
Some(("library".into(), "nginx".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_repo("https://hub.docker.com/_/nginx/tags"),
|
||||
Some(("library".into(), "nginx".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_repo("https://hub.docker.com/r/grafana/grafana"),
|
||||
Some(("grafana".into(), "grafana".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_repo("https://hub.docker.com/r/grafana/grafana/?foo=bar"),
|
||||
Some(("grafana".into(), "grafana".into()))
|
||||
);
|
||||
}
|
||||
}
|
||||
337
crates/webclaw-fetch/src/extractors/ebay_listing.rs
Normal file
337
crates/webclaw-fetch/src/extractors/ebay_listing.rs
Normal file
|
|
@ -0,0 +1,337 @@
|
|||
//! eBay listing extractor.
|
||||
//!
|
||||
//! eBay item pages at `ebay.com/itm/{id}` and international variants
|
||||
//! usually ship a `Product` JSON-LD block with title, price, currency,
|
||||
//! condition, and an `AggregateOffer` when bidding. eBay applies
|
||||
//! Cloudflare + custom WAF selectively — some item IDs return normal
|
||||
//! HTML to the Firefox profile, others 403 / get the "Pardon our
|
||||
//! interruption" page. We route through `cloud::smart_fetch_html` so
|
||||
//! both paths resolve to the same parser.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "ebay_listing",
|
||||
label: "eBay listing",
|
||||
description: "Returns item title, price, currency, condition, seller, shipping, and bid info. Heavy listings may need WEBCLAW_API_KEY for antibot.",
|
||||
url_patterns: &[
|
||||
"https://www.ebay.com/itm/{id}",
|
||||
"https://www.ebay.co.uk/itm/{id}",
|
||||
"https://www.ebay.de/itm/{id}",
|
||||
"https://www.ebay.fr/itm/{id}",
|
||||
"https://www.ebay.it/itm/{id}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !is_ebay_host(host) {
|
||||
return false;
|
||||
}
|
||||
parse_item_id(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let item_id = parse_item_id(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("ebay_listing: no item id in '{url}'")))?;
|
||||
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let mut data = parse(&fetched.html, url, &item_id);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
pub fn parse(html: &str, url: &str, item_id: &str) -> Value {
|
||||
let jsonld = find_product_jsonld(html);
|
||||
let title = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "name"))
|
||||
.or_else(|| og(html, "title"));
|
||||
let image = jsonld
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| og(html, "image"));
|
||||
let brand = jsonld.as_ref().and_then(get_brand);
|
||||
let description = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description"));
|
||||
let offer = jsonld.as_ref().and_then(first_offer);
|
||||
|
||||
// eBay's AggregateOffer uses lowPrice/highPrice. Offer uses price.
|
||||
let (low_price, high_price, single_price) = match offer.as_ref() {
|
||||
Some(o) => (
|
||||
get_text(o, "lowPrice"),
|
||||
get_text(o, "highPrice"),
|
||||
get_text(o, "price"),
|
||||
),
|
||||
None => (None, None, None),
|
||||
};
|
||||
let offer_count = offer.as_ref().and_then(|o| get_text(o, "offerCount"));
|
||||
|
||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"item_id": item_id,
|
||||
"title": title,
|
||||
"brand": brand,
|
||||
"description": description,
|
||||
"image": image,
|
||||
"price": single_price,
|
||||
"low_price": low_price,
|
||||
"high_price": high_price,
|
||||
"offer_count": offer_count,
|
||||
"currency": offer.as_ref().and_then(|o| get_text(o, "priceCurrency")),
|
||||
"availability": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "availability").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"condition": offer.as_ref().and_then(|o| {
|
||||
get_text(o, "itemCondition").map(|s|
|
||||
s.replace("http://schema.org/", "").replace("https://schema.org/", ""))
|
||||
}),
|
||||
"seller": offer.as_ref().and_then(|o|
|
||||
o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from)),
|
||||
"aggregate_rating": aggregate_rating,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn is_ebay_host(host: &str) -> bool {
|
||||
host.starts_with("www.ebay.") || host.starts_with("ebay.")
|
||||
}
|
||||
|
||||
/// Pull the numeric item id out of `/itm/{id}` or `/itm/{slug}/{id}`
|
||||
/// URLs. IDs are 10-15 digits today, but we accept any all-digit
|
||||
/// trailing segment so the extractor stays forward-compatible.
|
||||
fn parse_item_id(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
// /itm/(optional-slug/)?(digits)([/?#]|end)
|
||||
Regex::new(r"/itm/(?:[^/]+/)?(\d{8,})(?:[/?#]|$)").unwrap()
|
||||
});
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_product_jsonld(html: &str) -> Option<Value> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(&b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let Some(t) = v.get("@type") else {
|
||||
return false;
|
||||
};
|
||||
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
|
||||
match t {
|
||||
Value::String(s) => is_prod(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
brand
|
||||
.as_object()
|
||||
.and_then(|o| o.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
}
|
||||
|
||||
fn get_first_image(v: &Value) -> Option<String> {
|
||||
match v.get("image")? {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Array(arr) => arr.iter().find_map(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}),
|
||||
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn first_offer(v: &Value) -> Option<Value> {
|
||||
let offers = v.get("offers")?;
|
||||
match offers {
|
||||
Value::Array(arr) => arr.first().cloned(),
|
||||
Value::Object(_) => Some(offers.clone()),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
}))
|
||||
}
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_ebay_item_urls() {
|
||||
assert!(matches("https://www.ebay.com/itm/325478156234"));
|
||||
assert!(matches(
|
||||
"https://www.ebay.com/itm/vintage-typewriter/325478156234"
|
||||
));
|
||||
assert!(matches("https://www.ebay.co.uk/itm/325478156234"));
|
||||
assert!(!matches("https://www.ebay.com/"));
|
||||
assert!(!matches("https://www.ebay.com/sch/foo"));
|
||||
assert!(!matches("https://example.com/itm/325478156234"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_item_id_handles_slugged_urls() {
|
||||
assert_eq!(
|
||||
parse_item_id("https://www.ebay.com/itm/325478156234"),
|
||||
Some("325478156234".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_item_id("https://www.ebay.com/itm/vintage-typewriter/325478156234"),
|
||||
Some("325478156234".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_item_id("https://www.ebay.com/itm/325478156234?hash=abc"),
|
||||
Some("325478156234".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_extracts_from_fixture_jsonld() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"Vintage Typewriter","sku":"TW-001",
|
||||
"brand":{"@type":"Brand","name":"Olivetti"},
|
||||
"image":"https://i.ebayimg.com/images/abc.jpg",
|
||||
"offers":{"@type":"Offer","price":"79.99","priceCurrency":"GBP",
|
||||
"availability":"https://schema.org/InStock",
|
||||
"itemCondition":"https://schema.org/UsedCondition",
|
||||
"seller":{"@type":"Person","name":"vintage_seller_99"}}}
|
||||
</script>
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.ebay.co.uk/itm/325", "325");
|
||||
assert_eq!(v["title"], "Vintage Typewriter");
|
||||
assert_eq!(v["price"], "79.99");
|
||||
assert_eq!(v["currency"], "GBP");
|
||||
assert_eq!(v["availability"], "InStock");
|
||||
assert_eq!(v["condition"], "UsedCondition");
|
||||
assert_eq!(v["seller"], "vintage_seller_99");
|
||||
assert_eq!(v["brand"], "Olivetti");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_handles_aggregate_offer_price_range() {
|
||||
let html = r##"
|
||||
<script type="application/ld+json">
|
||||
{"@type":"Product","name":"Used Copies",
|
||||
"offers":{"@type":"AggregateOffer","offerCount":"5",
|
||||
"lowPrice":"10.00","highPrice":"50.00","priceCurrency":"USD"}}
|
||||
</script>
|
||||
"##;
|
||||
let v = parse(html, "https://www.ebay.com/itm/1", "1");
|
||||
assert_eq!(v["low_price"], "10.00");
|
||||
assert_eq!(v["high_price"], "50.00");
|
||||
assert_eq!(v["offer_count"], "5");
|
||||
assert_eq!(v["currency"], "USD");
|
||||
}
|
||||
}
|
||||
553
crates/webclaw-fetch/src/extractors/ecommerce_product.rs
Normal file
553
crates/webclaw-fetch/src/extractors/ecommerce_product.rs
Normal file
|
|
@ -0,0 +1,553 @@
|
|||
//! Generic ecommerce product extractor via Schema.org JSON-LD.
|
||||
//!
|
||||
//! Every modern ecommerce site ships a `<script type="application/ld+json">`
|
||||
//! Product block for SEO / rich-result snippets. Google's own SEO docs
|
||||
//! force this markup on anyone who wants to appear in shopping search.
|
||||
//! We take advantage of it: one extractor that works on Shopify,
|
||||
//! BigCommerce, WooCommerce, Squarespace, Magento, custom storefronts,
|
||||
//! and anything else that follows Schema.org.
|
||||
//!
|
||||
//! **Explicit-call only** (`/v1/scrape/ecommerce_product`). Not in the
|
||||
//! auto-dispatch because we can't identify "this is a product page"
|
||||
//! from the URL alone. When the caller knows they have a product URL,
|
||||
//! this is the reliable fallback for stores where shopify_product
|
||||
//! doesn't apply.
|
||||
//!
|
||||
//! The extractor reuses `webclaw_core::structured_data::extract_json_ld`
|
||||
//! so JSON-LD parsing is shared with the rest of the extraction
|
||||
//! pipeline. We walk all blocks looking for `@type: Product`,
|
||||
//! `ProductGroup`, or an `ItemList` whose first entry is a Product.
|
||||
//!
|
||||
//! ## OG fallback
|
||||
//!
|
||||
//! Two real-world cases JSON-LD alone can't cover:
|
||||
//!
|
||||
//! 1. Site has no Product JSON-LD at all (smaller Squarespace / custom
|
||||
//! storefronts, many European shops).
|
||||
//! 2. Site has Product JSON-LD but the `offers` block is empty (seen on
|
||||
//! Patagonia and other catalog-style sites that split price onto a
|
||||
//! separate widget).
|
||||
//!
|
||||
//! For case 1 we build a minimal payload from OG / product meta tags
|
||||
//! (`og:title`, `og:image`, `og:description`, `product:price:amount`,
|
||||
//! `product:price:currency`, `product:availability`, `product:brand`).
|
||||
//! For case 2 we augment the JSON-LD offers list with an OG-derived
|
||||
//! offer so callers get a price either way. A `data_source` field
|
||||
//! (`"jsonld"` / `"jsonld+og"` / `"og_fallback"`) tells the caller
|
||||
//! which branch produced the data.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "ecommerce_product",
|
||||
label: "Ecommerce product (generic)",
|
||||
description: "Returns product info from any site that ships Schema.org Product JSON-LD: name, description, images, brand, SKU, price, availability, aggregate rating.",
|
||||
url_patterns: &[
|
||||
"https://{any-ecom-store}/products/{slug}",
|
||||
"https://{any-ecom-store}/product/{slug}",
|
||||
"https://{any-ecom-store}/p/{slug}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
// Maximally permissive: explicit-call-only extractor. We trust the
|
||||
// caller knows they're pointing at a product page. Custom ecom
|
||||
// sites use every conceivable URL shape (warbyparker.com uses
|
||||
// `/eyeglasses/{category}/{slug}/{colour}`, etc.), so path-pattern
|
||||
// matching would false-negative a lot. All we gate on is a valid
|
||||
// http(s) URL with a host.
|
||||
if !(url.starts_with("http://") || url.starts_with("https://")) {
|
||||
return false;
|
||||
}
|
||||
!host_of(url).is_empty()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let resp = client.fetch(url).await?;
|
||||
if !(200..300).contains(&resp.status) {
|
||||
return Err(FetchError::Build(format!(
|
||||
"ecommerce_product: status {} for {url}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
parse(&resp.html, url).ok_or_else(|| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"ecommerce_product: no Schema.org Product JSON-LD and no OG product tags on {url}"
|
||||
))
|
||||
})
|
||||
}
|
||||
|
||||
/// Pure parser: try JSON-LD first, fall back to OG meta tags. Returns
|
||||
/// `None` when neither path has enough to say "this is a product page".
|
||||
pub fn parse(html: &str, url: &str) -> Option<Value> {
|
||||
// Reuse the core JSON-LD parser so we benefit from whatever
|
||||
// robustness it gains over time (handling @graph, arrays, etc.).
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
let product = find_product(&blocks);
|
||||
|
||||
if let Some(p) = product {
|
||||
Some(build_jsonld_payload(&p, html, url))
|
||||
} else if has_og_product_signal(html) {
|
||||
Some(build_og_payload(html, url))
|
||||
} else {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// Build the rich payload from a Product JSON-LD node. Augments the
|
||||
/// `offers` array with an OG-derived offer when JSON-LD offers is empty
|
||||
/// so callers get a price on sites like Patagonia.
|
||||
fn build_jsonld_payload(product: &Value, html: &str, url: &str) -> Value {
|
||||
let mut offers = collect_offers(product);
|
||||
let mut data_source = "jsonld";
|
||||
if offers.is_empty()
|
||||
&& let Some(og_offer) = build_og_offer(html)
|
||||
{
|
||||
offers.push(og_offer);
|
||||
data_source = "jsonld+og";
|
||||
}
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"data_source": data_source,
|
||||
"name": get_text(product, "name").or_else(|| og(html, "title")),
|
||||
"description": get_text(product, "description").or_else(|| og(html, "description")),
|
||||
"brand": get_brand(product).or_else(|| meta_property(html, "product:brand")),
|
||||
"sku": get_text(product, "sku"),
|
||||
"mpn": get_text(product, "mpn"),
|
||||
"gtin": get_text(product, "gtin")
|
||||
.or_else(|| get_text(product, "gtin13"))
|
||||
.or_else(|| get_text(product, "gtin12"))
|
||||
.or_else(|| get_text(product, "gtin8")),
|
||||
"product_id": get_text(product, "productID"),
|
||||
"category": get_text(product, "category"),
|
||||
"color": get_text(product, "color"),
|
||||
"material": get_text(product, "material"),
|
||||
"images": nonempty_or_og(collect_images(product), html),
|
||||
"offers": offers,
|
||||
"aggregate_rating": get_aggregate_rating(product),
|
||||
"review_count": get_review_count(product),
|
||||
"raw_schema_type": get_text(product, "@type"),
|
||||
"raw_jsonld": product.clone(),
|
||||
})
|
||||
}
|
||||
|
||||
/// Build a minimal payload from OG / product meta tags. Used when a
|
||||
/// page has no Product JSON-LD at all.
|
||||
fn build_og_payload(html: &str, url: &str) -> Value {
|
||||
let offers = build_og_offer(html).map(|o| vec![o]).unwrap_or_default();
|
||||
let image = og(html, "image");
|
||||
let images: Vec<Value> = image.map(|i| vec![Value::String(i)]).unwrap_or_default();
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"data_source": "og_fallback",
|
||||
"name": og(html, "title"),
|
||||
"description": og(html, "description"),
|
||||
"brand": meta_property(html, "product:brand"),
|
||||
"sku": None::<String>,
|
||||
"mpn": None::<String>,
|
||||
"gtin": None::<String>,
|
||||
"product_id": None::<String>,
|
||||
"category": None::<String>,
|
||||
"color": None::<String>,
|
||||
"material": None::<String>,
|
||||
"images": images,
|
||||
"offers": offers,
|
||||
"aggregate_rating": Value::Null,
|
||||
"review_count": None::<String>,
|
||||
"raw_schema_type": None::<String>,
|
||||
"raw_jsonld": Value::Null,
|
||||
})
|
||||
}
|
||||
|
||||
fn nonempty_or_og(imgs: Vec<Value>, html: &str) -> Vec<Value> {
|
||||
if !imgs.is_empty() {
|
||||
return imgs;
|
||||
}
|
||||
og(html, "image")
|
||||
.map(|s| vec![Value::String(s)])
|
||||
.unwrap_or_default()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Recursively walk the JSON-LD blocks and return the first node whose
|
||||
/// `@type` is Product, ProductGroup, or IndividualProduct.
|
||||
fn find_product(blocks: &[Value]) -> Option<Value> {
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
// @graph: [ {...}, {...} ]
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
// Bare array wrapper
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let t = match v.get("@type") {
|
||||
Some(t) => t,
|
||||
None => return false,
|
||||
};
|
||||
let match_str = |s: &str| {
|
||||
matches!(
|
||||
s,
|
||||
"Product" | "ProductGroup" | "IndividualProduct" | "Vehicle" | "SomeProducts"
|
||||
)
|
||||
};
|
||||
match t {
|
||||
Value::String(s) => match_str(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(match_str)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
if let Some(obj) = brand.as_object()
|
||||
&& let Some(n) = obj.get("name").and_then(|x| x.as_str())
|
||||
{
|
||||
return Some(n.to_string());
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn collect_images(v: &Value) -> Vec<Value> {
|
||||
match v.get("image") {
|
||||
Some(Value::String(s)) => vec![Value::String(s.clone())],
|
||||
Some(Value::Array(arr)) => arr
|
||||
.iter()
|
||||
.filter_map(|x| match x {
|
||||
Value::String(s) => Some(Value::String(s.clone())),
|
||||
Value::Object(_) => x.get("url").cloned(),
|
||||
_ => None,
|
||||
})
|
||||
.collect(),
|
||||
Some(Value::Object(o)) => o.get("url").cloned().into_iter().collect(),
|
||||
_ => Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// Normalise both bare Offer and AggregateOffer into a uniform array.
|
||||
fn collect_offers(v: &Value) -> Vec<Value> {
|
||||
let offers = match v.get("offers") {
|
||||
Some(o) => o,
|
||||
None => return Vec::new(),
|
||||
};
|
||||
let collect_single = |o: &Value| -> Option<Value> {
|
||||
Some(json!({
|
||||
"price": get_text(o, "price"),
|
||||
"low_price": get_text(o, "lowPrice"),
|
||||
"high_price": get_text(o, "highPrice"),
|
||||
"currency": get_text(o, "priceCurrency"),
|
||||
"availability": get_text(o, "availability").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
|
||||
"item_condition": get_text(o, "itemCondition").map(|s| s.replace("http://schema.org/", "").replace("https://schema.org/", "")),
|
||||
"valid_until": get_text(o, "priceValidUntil"),
|
||||
"url": get_text(o, "url"),
|
||||
"seller": o.get("seller").and_then(|s| s.get("name")).and_then(|n| n.as_str()).map(String::from),
|
||||
"offer_count": get_text(o, "offerCount"),
|
||||
}))
|
||||
};
|
||||
match offers {
|
||||
Value::Array(arr) => arr.iter().filter_map(collect_single).collect(),
|
||||
Value::Object(_) => collect_single(offers).into_iter().collect(),
|
||||
_ => Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
"worst_rating": get_text(r, "worstRating"),
|
||||
"rating_count": get_text(r, "ratingCount"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
}))
|
||||
}
|
||||
|
||||
fn get_review_count(v: &Value) -> Option<String> {
|
||||
v.get("aggregateRating")
|
||||
.and_then(|r| get_text(r, "reviewCount"))
|
||||
.or_else(|| get_text(v, "reviewCount"))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// OG / product meta-tag helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// True when the HTML has enough OG / product meta tags to justify
|
||||
/// building a fallback payload. A single `og:title` isn't enough on its
|
||||
/// own — every blog post has that. We require either a product price
|
||||
/// tag or at least an `og:type` of `product`/`og:product` to avoid
|
||||
/// mis-classifying articles as products.
|
||||
fn has_og_product_signal(html: &str) -> bool {
|
||||
let has_price = meta_property(html, "product:price:amount").is_some()
|
||||
|| meta_property(html, "og:price:amount").is_some();
|
||||
if has_price {
|
||||
return true;
|
||||
}
|
||||
// `<meta property="og:type" content="product">` is the Schema.org OG
|
||||
// marker for product pages.
|
||||
let og_type = og(html, "type").unwrap_or_default().to_lowercase();
|
||||
matches!(og_type.as_str(), "product" | "og:product" | "product.item")
|
||||
}
|
||||
|
||||
/// Build a single Offer-shaped Value from OG / product meta tags, or
|
||||
/// `None` if there's no price info at all.
|
||||
fn build_og_offer(html: &str) -> Option<Value> {
|
||||
let price = meta_property(html, "product:price:amount")
|
||||
.or_else(|| meta_property(html, "og:price:amount"));
|
||||
let currency = meta_property(html, "product:price:currency")
|
||||
.or_else(|| meta_property(html, "og:price:currency"));
|
||||
let availability = meta_property(html, "product:availability")
|
||||
.or_else(|| meta_property(html, "og:availability"));
|
||||
price.as_ref()?;
|
||||
Some(json!({
|
||||
"price": price,
|
||||
"low_price": None::<String>,
|
||||
"high_price": None::<String>,
|
||||
"currency": currency,
|
||||
"availability": availability,
|
||||
"item_condition": None::<String>,
|
||||
"valid_until": None::<String>,
|
||||
"url": None::<String>,
|
||||
"seller": None::<String>,
|
||||
"offer_count": None::<String>,
|
||||
}))
|
||||
}
|
||||
|
||||
/// Pull the value of `<meta property="og:{prop}" content="...">`.
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Pull the value of any `<meta property="..." content="...">` tag.
|
||||
/// Needed for namespaced OG variants like `product:price:amount` that
|
||||
/// the simple `og:*` matcher above doesn't cover.
|
||||
fn meta_property(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use serde_json::json;
|
||||
|
||||
#[test]
|
||||
fn matches_any_http_url_with_host() {
|
||||
assert!(matches("https://www.allbirds.com/products/tree-runner"));
|
||||
assert!(matches(
|
||||
"https://www.warbyparker.com/eyeglasses/women/percey/jet-black-with-polished-gold"
|
||||
));
|
||||
assert!(matches("https://example.com/p/widget"));
|
||||
assert!(matches("http://shop.example.com/foo/bar"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_empty_or_non_http() {
|
||||
assert!(!matches(""));
|
||||
assert!(!matches("not-a-url"));
|
||||
assert!(!matches("ftp://example.com/file"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn find_product_walks_graph() {
|
||||
let block = json!({
|
||||
"@context": "https://schema.org",
|
||||
"@graph": [
|
||||
{"@type": "Organization", "name": "ACME"},
|
||||
{"@type": "Product", "name": "Widget", "sku": "ABC"}
|
||||
]
|
||||
});
|
||||
let blocks = vec![block];
|
||||
let p = find_product(&blocks).unwrap();
|
||||
assert_eq!(p.get("name").and_then(|v| v.as_str()), Some("Widget"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn find_product_handles_array_type() {
|
||||
let block = json!({
|
||||
"@type": ["Product", "Clothing"],
|
||||
"name": "Tee"
|
||||
});
|
||||
assert!(is_product_type(&block));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn get_brand_from_string_or_object() {
|
||||
assert_eq!(get_brand(&json!({"brand": "ACME"})), Some("ACME".into()));
|
||||
assert_eq!(
|
||||
get_brand(&json!({"brand": {"@type": "Brand", "name": "ACME"}})),
|
||||
Some("ACME".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn collect_offers_handles_single_and_aggregate() {
|
||||
let p = json!({
|
||||
"offers": {
|
||||
"@type": "Offer",
|
||||
"price": "19.99",
|
||||
"priceCurrency": "USD",
|
||||
"availability": "https://schema.org/InStock"
|
||||
}
|
||||
});
|
||||
let offers = collect_offers(&p);
|
||||
assert_eq!(offers.len(), 1);
|
||||
assert_eq!(
|
||||
offers[0].get("price").and_then(|v| v.as_str()),
|
||||
Some("19.99")
|
||||
);
|
||||
assert_eq!(
|
||||
offers[0].get("availability").and_then(|v| v.as_str()),
|
||||
Some("InStock")
|
||||
);
|
||||
}
|
||||
|
||||
// --- OG fallback --------------------------------------------------------
|
||||
|
||||
#[test]
|
||||
fn has_og_product_signal_accepts_product_type_or_price() {
|
||||
let type_only = r#"<meta property="og:type" content="product">"#;
|
||||
let price_only = r#"<meta property="product:price:amount" content="49.00">"#;
|
||||
let neither = r#"<meta property="og:title" content="My Article"><meta property="og:type" content="article">"#;
|
||||
assert!(has_og_product_signal(type_only));
|
||||
assert!(has_og_product_signal(price_only));
|
||||
assert!(!has_og_product_signal(neither));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn og_fallback_builds_payload_without_jsonld() {
|
||||
let html = r##"<html><head>
|
||||
<meta property="og:type" content="product">
|
||||
<meta property="og:title" content="Handmade Candle">
|
||||
<meta property="og:image" content="https://cdn.example.com/candle.jpg">
|
||||
<meta property="og:description" content="Small-batch soy candle.">
|
||||
<meta property="product:price:amount" content="18.00">
|
||||
<meta property="product:price:currency" content="USD">
|
||||
<meta property="product:availability" content="in stock">
|
||||
<meta property="product:brand" content="Little Studio">
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://example.com/p/candle").unwrap();
|
||||
assert_eq!(v["data_source"], "og_fallback");
|
||||
assert_eq!(v["name"], "Handmade Candle");
|
||||
assert_eq!(v["description"], "Small-batch soy candle.");
|
||||
assert_eq!(v["brand"], "Little Studio");
|
||||
assert_eq!(v["offers"][0]["price"], "18.00");
|
||||
assert_eq!(v["offers"][0]["currency"], "USD");
|
||||
assert_eq!(v["offers"][0]["availability"], "in stock");
|
||||
assert_eq!(v["images"][0], "https://cdn.example.com/candle.jpg");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn jsonld_augments_empty_offers_with_og_price() {
|
||||
// Patagonia-shaped page: Product JSON-LD without an Offer, plus
|
||||
// product:price:* OG tags. We should merge.
|
||||
let html = r##"<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"Better Sweater","brand":"Patagonia",
|
||||
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.4","reviewCount":"1142"}}
|
||||
</script>
|
||||
<meta property="product:price:amount" content="139.00">
|
||||
<meta property="product:price:currency" content="USD">
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://patagonia.com/p/x").unwrap();
|
||||
assert_eq!(v["data_source"], "jsonld+og");
|
||||
assert_eq!(v["name"], "Better Sweater");
|
||||
assert_eq!(v["offers"].as_array().unwrap().len(), 1);
|
||||
assert_eq!(v["offers"][0]["price"], "139.00");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn jsonld_only_stays_pure_jsonld() {
|
||||
let html = r##"<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@type":"Product","name":"Widget",
|
||||
"offers":{"@type":"Offer","price":"9.99","priceCurrency":"USD"}}
|
||||
</script>
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://example.com/p/w").unwrap();
|
||||
assert_eq!(v["data_source"], "jsonld");
|
||||
assert_eq!(v["offers"][0]["price"], "9.99");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_returns_none_on_no_product_signals() {
|
||||
let html = r#"<html><head>
|
||||
<meta property="og:title" content="My Blog Post">
|
||||
<meta property="og:type" content="article">
|
||||
</head></html>"#;
|
||||
assert!(parse(html, "https://blog.example.com/post").is_none());
|
||||
}
|
||||
}
|
||||
572
crates/webclaw-fetch/src/extractors/etsy_listing.rs
Normal file
572
crates/webclaw-fetch/src/extractors/etsy_listing.rs
Normal file
|
|
@ -0,0 +1,572 @@
|
|||
//! Etsy listing extractor.
|
||||
//!
|
||||
//! Etsy product pages at `etsy.com/listing/{id}` (and a sluggy variant
|
||||
//! `etsy.com/listing/{id}/{slug}`) ship a Schema.org `Product` JSON-LD
|
||||
//! block with title, price, currency, availability, shop seller, and
|
||||
//! an `AggregateRating` for the listing.
|
||||
//!
|
||||
//! Etsy puts Cloudflare + custom WAF in front of product pages with a
|
||||
//! high variance: the Firefox profile gets clean HTML most of the time
|
||||
//! but some listings return a CF interstitial. We route through
|
||||
//! `cloud::smart_fetch_html` so both paths resolve to the same parser,
|
||||
//! same as `ebay_listing`.
|
||||
//!
|
||||
//! ## URL slug as last-resort title
|
||||
//!
|
||||
//! Even with cloud antibot bypass, Etsy frequently serves a generic
|
||||
//! page with minimal metadata (`og:title = "etsy.com"`, no JSON-LD,
|
||||
//! empty markdown). In that case we humanise the slug from the URL
|
||||
//! (`/listing/{id}/personalized-stainless-steel-tumbler` becomes
|
||||
//! "Personalized Stainless Steel Tumbler") so callers always get a
|
||||
//! meaningful title. Degrades gracefully when the URL has no slug.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "etsy_listing",
|
||||
label: "Etsy listing",
|
||||
description: "Returns listing title, price, currency, availability, shop, rating, and image. Heavy listings may need WEBCLAW_API_KEY for antibot.",
|
||||
url_patterns: &[
|
||||
"https://www.etsy.com/listing/{id}",
|
||||
"https://www.etsy.com/listing/{id}/{slug}",
|
||||
"https://www.etsy.com/{locale}/listing/{id}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !is_etsy_host(host) {
|
||||
return false;
|
||||
}
|
||||
parse_listing_id(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let listing_id = parse_listing_id(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("etsy_listing: no listing id in '{url}'")))?;
|
||||
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let mut data = parse(&fetched.html, url, &listing_id);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
pub fn parse(html: &str, url: &str, listing_id: &str) -> Value {
|
||||
let jsonld = find_product_jsonld(html);
|
||||
let slug_title = humanise_slug(parse_slug(url).as_deref());
|
||||
|
||||
let title = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "name"))
|
||||
.or_else(|| og(html, "title").filter(|t| !is_generic_title(t)))
|
||||
.or(slug_title);
|
||||
let description = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description").filter(|d| !is_generic_description(d)));
|
||||
let image = jsonld
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| og(html, "image"));
|
||||
let brand = jsonld.as_ref().and_then(get_brand);
|
||||
|
||||
// Etsy listings often ship either a single Offer or an
|
||||
// AggregateOffer when the listing has variants with different prices.
|
||||
let offer = jsonld.as_ref().and_then(first_offer);
|
||||
let (low_price, high_price, single_price) = match offer.as_ref() {
|
||||
Some(o) => (
|
||||
get_text(o, "lowPrice"),
|
||||
get_text(o, "highPrice"),
|
||||
get_text(o, "price"),
|
||||
),
|
||||
None => (None, None, None),
|
||||
};
|
||||
let currency = offer.as_ref().and_then(|o| get_text(o, "priceCurrency"));
|
||||
let availability = offer
|
||||
.as_ref()
|
||||
.and_then(|o| get_text(o, "availability").map(strip_schema_prefix));
|
||||
let item_condition = jsonld
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "itemCondition"))
|
||||
.map(strip_schema_prefix);
|
||||
|
||||
// Shop name: offers[0].seller.name on newer listings, top-level
|
||||
// `brand` on older listings (Etsy changed the schema around 2022).
|
||||
// Fall back through both so either shape resolves.
|
||||
let shop = offer
|
||||
.as_ref()
|
||||
.and_then(|o| {
|
||||
o.get("seller")
|
||||
.and_then(|s| s.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
})
|
||||
.or_else(|| brand.clone());
|
||||
let shop_url = shop_url_from_html(html);
|
||||
|
||||
let aggregate_rating = jsonld.as_ref().and_then(get_aggregate_rating);
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"listing_id": listing_id,
|
||||
"title": title,
|
||||
"description": description,
|
||||
"image": image,
|
||||
"brand": brand,
|
||||
"price": single_price,
|
||||
"low_price": low_price,
|
||||
"high_price": high_price,
|
||||
"currency": currency,
|
||||
"availability": availability,
|
||||
"item_condition": item_condition,
|
||||
"shop": shop,
|
||||
"shop_url": shop_url,
|
||||
"aggregate_rating": aggregate_rating,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn is_etsy_host(host: &str) -> bool {
|
||||
host == "etsy.com" || host == "www.etsy.com" || host.ends_with(".etsy.com")
|
||||
}
|
||||
|
||||
/// Extract the numeric listing id. Etsy ids are 9-11 digits today but
|
||||
/// we accept any all-digit segment right after `/listing/`.
|
||||
///
|
||||
/// Handles `/listing/{id}`, `/listing/{id}/{slug}`, and the localised
|
||||
/// `/{locale}/listing/{id}` shape (e.g. `/fr/listing/...`).
|
||||
fn parse_listing_id(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"/listing/(\d{6,})(?:[/?#]|$)").unwrap());
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// Extract the URL slug after the listing id, e.g.
|
||||
/// `personalized-stainless-steel-tumbler`. Returns `None` when the URL
|
||||
/// is the bare `/listing/{id}` shape.
|
||||
fn parse_slug(url: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"/listing/\d{6,}/([^/?#]+)").unwrap());
|
||||
re.captures(url)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// Turn a URL slug into a human-ish title:
|
||||
/// `personalized-stainless-steel-tumbler` → `Personalized Stainless
|
||||
/// Steel Tumbler`. Word-cap each dash-separated token; preserves
|
||||
/// underscores as spaces too. Returns `None` on empty input.
|
||||
fn humanise_slug(slug: Option<&str>) -> Option<String> {
|
||||
let raw = slug?.trim();
|
||||
if raw.is_empty() {
|
||||
return None;
|
||||
}
|
||||
let words: Vec<String> = raw
|
||||
.split(['-', '_'])
|
||||
.filter(|w| !w.is_empty())
|
||||
.map(capitalise_word)
|
||||
.collect();
|
||||
if words.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(words.join(" "))
|
||||
}
|
||||
}
|
||||
|
||||
fn capitalise_word(w: &str) -> String {
|
||||
let mut chars = w.chars();
|
||||
match chars.next() {
|
||||
Some(first) => first.to_uppercase().collect::<String>() + chars.as_str(),
|
||||
None => String::new(),
|
||||
}
|
||||
}
|
||||
|
||||
/// True when the OG title is Etsy's fallback-page title rather than a
|
||||
/// listing-specific title. Expired / region-blocked / antibot-filtered
|
||||
/// pages return Etsy's sitewide tagline:
|
||||
/// `"Etsy - Your place to buy and sell all things handmade..."`, or
|
||||
/// simply `"etsy.com"`. A real listing title always starts with the
|
||||
/// item name, never with "Etsy - " or the domain.
|
||||
fn is_generic_title(t: &str) -> bool {
|
||||
let normalised = t.trim().to_lowercase();
|
||||
if matches!(
|
||||
normalised.as_str(),
|
||||
"etsy.com" | "etsy" | "www.etsy.com" | ""
|
||||
) {
|
||||
return true;
|
||||
}
|
||||
// Etsy's sitewide marketing tagline, served on 404 / blocked pages.
|
||||
if normalised.starts_with("etsy - ")
|
||||
|| normalised.starts_with("etsy.com - ")
|
||||
|| normalised.starts_with("etsy uk - ")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
// Etsy's "item unavailable" placeholder, served on delisted
|
||||
// products. Keep the slug fallback so callers still see what the
|
||||
// URL was about.
|
||||
normalised.starts_with("this item is unavailable")
|
||||
|| normalised.starts_with("sorry, this item is")
|
||||
|| normalised == "item not available - etsy"
|
||||
}
|
||||
|
||||
/// True when the OG description is an Etsy error-page placeholder or
|
||||
/// sitewide marketing blurb rather than a real listing description.
|
||||
fn is_generic_description(d: &str) -> bool {
|
||||
let normalised = d.trim().to_lowercase();
|
||||
if normalised.is_empty() {
|
||||
return true;
|
||||
}
|
||||
normalised.starts_with("sorry, the page you were looking for")
|
||||
|| normalised.starts_with("page not found")
|
||||
|| normalised.starts_with("find the perfect handmade gift")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers (same shape as ebay_listing; kept separate so the two
|
||||
// extractors can diverge without cross-impact)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_product_jsonld(html: &str) -> Option<Value> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
for b in blocks {
|
||||
if let Some(found) = find_product_in(&b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_product_in(v: &Value) -> Option<Value> {
|
||||
if is_product_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_product_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_product_type(v: &Value) -> bool {
|
||||
let Some(t) = v.get("@type") else {
|
||||
return false;
|
||||
};
|
||||
let is_prod = |s: &str| matches!(s, "Product" | "ProductGroup" | "IndividualProduct");
|
||||
match t {
|
||||
Value::String(s) => is_prod(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_prod)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_brand(v: &Value) -> Option<String> {
|
||||
let brand = v.get("brand")?;
|
||||
if let Some(s) = brand.as_str() {
|
||||
return Some(s.to_string());
|
||||
}
|
||||
brand
|
||||
.as_object()
|
||||
.and_then(|o| o.get("name"))
|
||||
.and_then(|n| n.as_str())
|
||||
.map(String::from)
|
||||
}
|
||||
|
||||
fn get_first_image(v: &Value) -> Option<String> {
|
||||
match v.get("image")? {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Array(arr) => arr.iter().find_map(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}),
|
||||
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn first_offer(v: &Value) -> Option<Value> {
|
||||
let offers = v.get("offers")?;
|
||||
match offers {
|
||||
Value::Array(arr) => arr.first().cloned(),
|
||||
Value::Object(_) => Some(offers.clone()),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_aggregate_rating(v: &Value) -> Option<Value> {
|
||||
let r = v.get("aggregateRating")?;
|
||||
Some(json!({
|
||||
"rating_value": get_text(r, "ratingValue"),
|
||||
"review_count": get_text(r, "reviewCount"),
|
||||
"best_rating": get_text(r, "bestRating"),
|
||||
}))
|
||||
}
|
||||
|
||||
fn strip_schema_prefix(s: String) -> String {
|
||||
s.replace("http://schema.org/", "")
|
||||
.replace("https://schema.org/", "")
|
||||
}
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Etsy links the owning shop with a canonical anchor like
|
||||
/// `<a href="/shop/ShopName" ...>`. Grab the first one after the
|
||||
/// breadcrumb boundary.
|
||||
fn shop_url_from_html(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r#"href="(/shop/[A-Za-z0-9_-]+)""#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| format!("https://www.etsy.com{}", m.as_str()))
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_etsy_listing_urls() {
|
||||
assert!(matches("https://www.etsy.com/listing/123456789"));
|
||||
assert!(matches(
|
||||
"https://www.etsy.com/listing/123456789/vintage-typewriter"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://www.etsy.com/fr/listing/123456789/vintage-typewriter"
|
||||
));
|
||||
assert!(!matches("https://www.etsy.com/"));
|
||||
assert!(!matches("https://www.etsy.com/shop/SomeShop"));
|
||||
assert!(!matches("https://example.com/listing/123456789"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_listing_id_handles_slug_and_locale() {
|
||||
assert_eq!(
|
||||
parse_listing_id("https://www.etsy.com/listing/123456789"),
|
||||
Some("123456789".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_listing_id("https://www.etsy.com/listing/123456789/slug-here"),
|
||||
Some("123456789".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_listing_id("https://www.etsy.com/fr/listing/123456789/slug"),
|
||||
Some("123456789".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_listing_id("https://www.etsy.com/listing/123456789?ref=foo"),
|
||||
Some("123456789".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_extracts_from_fixture_jsonld() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"Product",
|
||||
"name":"Handmade Ceramic Mug","sku":"MUG-001",
|
||||
"brand":{"@type":"Brand","name":"Studio Clay"},
|
||||
"image":["https://i.etsystatic.com/abc.jpg","https://i.etsystatic.com/xyz.jpg"],
|
||||
"itemCondition":"https://schema.org/NewCondition",
|
||||
"offers":{"@type":"Offer","price":"24.00","priceCurrency":"USD",
|
||||
"availability":"https://schema.org/InStock",
|
||||
"seller":{"@type":"Organization","name":"StudioClay"}},
|
||||
"aggregateRating":{"@type":"AggregateRating","ratingValue":"4.9","reviewCount":"127","bestRating":"5"}}
|
||||
</script>
|
||||
<a href="/shop/StudioClay" class="wt-text-link">StudioClay</a>
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.etsy.com/listing/1", "1");
|
||||
assert_eq!(v["title"], "Handmade Ceramic Mug");
|
||||
assert_eq!(v["price"], "24.00");
|
||||
assert_eq!(v["currency"], "USD");
|
||||
assert_eq!(v["availability"], "InStock");
|
||||
assert_eq!(v["item_condition"], "NewCondition");
|
||||
assert_eq!(v["shop"], "StudioClay");
|
||||
assert_eq!(v["shop_url"], "https://www.etsy.com/shop/StudioClay");
|
||||
assert_eq!(v["brand"], "Studio Clay");
|
||||
assert_eq!(v["aggregate_rating"]["rating_value"], "4.9");
|
||||
assert_eq!(v["aggregate_rating"]["review_count"], "127");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_handles_aggregate_offer_price_range() {
|
||||
let html = r##"
|
||||
<script type="application/ld+json">
|
||||
{"@type":"Product","name":"Mug Set",
|
||||
"offers":{"@type":"AggregateOffer",
|
||||
"lowPrice":"18.00","highPrice":"36.00","priceCurrency":"USD"}}
|
||||
</script>
|
||||
"##;
|
||||
let v = parse(html, "https://www.etsy.com/listing/2", "2");
|
||||
assert_eq!(v["low_price"], "18.00");
|
||||
assert_eq!(v["high_price"], "36.00");
|
||||
assert_eq!(v["currency"], "USD");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_og_when_no_jsonld() {
|
||||
let html = r#"
|
||||
<html><head>
|
||||
<meta property="og:title" content="Minimal Fallback Item">
|
||||
<meta property="og:description" content="OG-only extraction test.">
|
||||
<meta property="og:image" content="https://i.etsystatic.com/fallback.jpg">
|
||||
</head></html>"#;
|
||||
let v = parse(html, "https://www.etsy.com/listing/3", "3");
|
||||
assert_eq!(v["title"], "Minimal Fallback Item");
|
||||
assert_eq!(v["description"], "OG-only extraction test.");
|
||||
assert_eq!(v["image"], "https://i.etsystatic.com/fallback.jpg");
|
||||
// No price fields when we only have OG.
|
||||
assert!(v["price"].is_null());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_slug_from_url() {
|
||||
assert_eq!(
|
||||
parse_slug("https://www.etsy.com/listing/123456789/vintage-typewriter"),
|
||||
Some("vintage-typewriter".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://www.etsy.com/listing/123456789/slug?ref=shop"),
|
||||
Some("slug".into())
|
||||
);
|
||||
assert_eq!(parse_slug("https://www.etsy.com/listing/123456789"), None);
|
||||
assert_eq!(
|
||||
parse_slug("https://www.etsy.com/fr/listing/123456789/slug"),
|
||||
Some("slug".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn humanise_slug_capitalises_each_word() {
|
||||
assert_eq!(
|
||||
humanise_slug(Some("personalized-stainless-steel-tumbler")).as_deref(),
|
||||
Some("Personalized Stainless Steel Tumbler")
|
||||
);
|
||||
assert_eq!(
|
||||
humanise_slug(Some("hand_crafted_mug")).as_deref(),
|
||||
Some("Hand Crafted Mug")
|
||||
);
|
||||
assert_eq!(humanise_slug(Some("")), None);
|
||||
assert_eq!(humanise_slug(None), None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_generic_title_catches_common_shapes() {
|
||||
assert!(is_generic_title("etsy.com"));
|
||||
assert!(is_generic_title("Etsy"));
|
||||
assert!(is_generic_title(" etsy.com "));
|
||||
assert!(is_generic_title(
|
||||
"Etsy - Your place to buy and sell all things handmade, vintage, and supplies"
|
||||
));
|
||||
assert!(is_generic_title("Etsy UK - Vintage & Handmade"));
|
||||
assert!(!is_generic_title("Vintage Typewriter"));
|
||||
assert!(!is_generic_title("Handmade Etsy-style Mug"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn is_generic_description_catches_404_shapes() {
|
||||
assert!(is_generic_description(""));
|
||||
assert!(is_generic_description(
|
||||
"Sorry, the page you were looking for was not found."
|
||||
));
|
||||
assert!(is_generic_description("Page not found"));
|
||||
assert!(!is_generic_description(
|
||||
"Hand-thrown ceramic mug, dishwasher safe."
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_uses_slug_when_og_is_generic() {
|
||||
// Cloud-blocked Etsy listing: og:title is a site-wide generic
|
||||
// placeholder, no JSON-LD, no description. Slug should win.
|
||||
let html = r#"<html><head>
|
||||
<meta property="og:title" content="etsy.com">
|
||||
</head></html>"#;
|
||||
let v = parse(
|
||||
html,
|
||||
"https://www.etsy.com/listing/1079113183/personalized-stainless-steel-tumbler",
|
||||
"1079113183",
|
||||
);
|
||||
assert_eq!(v["title"], "Personalized Stainless Steel Tumbler");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_prefers_real_og_over_slug() {
|
||||
let html = r#"<html><head>
|
||||
<meta property="og:title" content="Real Listing Title">
|
||||
</head></html>"#;
|
||||
let v = parse(
|
||||
html,
|
||||
"https://www.etsy.com/listing/1079113183/the-url-slug",
|
||||
"1079113183",
|
||||
);
|
||||
assert_eq!(v["title"], "Real Listing Title");
|
||||
}
|
||||
}
|
||||
172
crates/webclaw-fetch/src/extractors/github_issue.rs
Normal file
172
crates/webclaw-fetch/src/extractors/github_issue.rs
Normal file
|
|
@ -0,0 +1,172 @@
|
|||
//! GitHub issue structured extractor.
|
||||
//!
|
||||
//! Mirror of `github_pr` but on `/issues/{number}`. Uses
|
||||
//! `api.github.com/repos/{owner}/{repo}/issues/{number}`. Returns the
|
||||
//! issue body + comment count + labels + milestone + author /
|
||||
//! assignees. Full per-comment bodies would be another call; kept for
|
||||
//! a follow-up.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_issue",
|
||||
label: "GitHub issue",
|
||||
description: "Returns issue metadata: title, body, state, author, labels, assignees, milestone, comment count.",
|
||||
url_patterns: &["https://github.com/{owner}/{repo}/issues/{number}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if host != "github.com" && host != "www.github.com" {
|
||||
return false;
|
||||
}
|
||||
parse_issue(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo, number) = parse_issue(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_issue: cannot parse issue URL '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://api.github.com/repos/{owner}/{repo}/issues/{number}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github_issue: issue '{owner}/{repo}#{number}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(
|
||||
"github_issue: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
|
||||
));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let issue: Issue = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("github issue parse: {e}")))?;
|
||||
|
||||
// The same endpoint returns PRs too; reject if we got one so the caller
|
||||
// uses /v1/scrape/github_pr instead of getting a half-shaped payload.
|
||||
if issue.pull_request.is_some() {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github_issue: '{owner}/{repo}#{number}' is a pull request, use /v1/scrape/github_pr"
|
||||
)));
|
||||
}
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"owner": owner,
|
||||
"repo": repo,
|
||||
"number": issue.number,
|
||||
"title": issue.title,
|
||||
"body": issue.body,
|
||||
"state": issue.state,
|
||||
"state_reason":issue.state_reason,
|
||||
"author": issue.user.as_ref().and_then(|u| u.login.clone()),
|
||||
"labels": issue.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
|
||||
"assignees": issue.assignees.iter().filter_map(|u| u.login.clone()).collect::<Vec<_>>(),
|
||||
"milestone": issue.milestone.as_ref().and_then(|m| m.title.clone()),
|
||||
"comments": issue.comments,
|
||||
"locked": issue.locked,
|
||||
"created_at": issue.created_at,
|
||||
"updated_at": issue.updated_at,
|
||||
"closed_at": issue.closed_at,
|
||||
"html_url": issue.html_url,
|
||||
}))
|
||||
}
|
||||
|
||||
fn parse_issue(url: &str) -> Option<(String, String, u64)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
if segs.len() < 4 || segs[2] != "issues" {
|
||||
return None;
|
||||
}
|
||||
let number: u64 = segs[3].parse().ok()?;
|
||||
Some((segs[0].to_string(), segs[1].to_string(), number))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// GitHub issue API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Issue {
|
||||
number: Option<i64>,
|
||||
title: Option<String>,
|
||||
body: Option<String>,
|
||||
state: Option<String>,
|
||||
state_reason: Option<String>,
|
||||
locked: Option<bool>,
|
||||
comments: Option<i64>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
closed_at: Option<String>,
|
||||
html_url: Option<String>,
|
||||
user: Option<UserRef>,
|
||||
#[serde(default)]
|
||||
labels: Vec<LabelRef>,
|
||||
#[serde(default)]
|
||||
assignees: Vec<UserRef>,
|
||||
milestone: Option<Milestone>,
|
||||
/// Present when this "issue" is actually a pull request. The REST
|
||||
/// API overloads the issues endpoint for PRs.
|
||||
pull_request: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct UserRef {
|
||||
login: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct LabelRef {
|
||||
name: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Milestone {
|
||||
title: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_issue_urls() {
|
||||
assert!(matches("https://github.com/rust-lang/rust/issues/100"));
|
||||
assert!(matches("https://github.com/rust-lang/rust/issues/100/"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/issues"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_issue_extracts_owner_repo_number() {
|
||||
assert_eq!(
|
||||
parse_issue("https://github.com/rust-lang/rust/issues/100"),
|
||||
Some(("rust-lang".into(), "rust".into(), 100))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_issue("https://github.com/rust-lang/rust/issues/100/?foo=bar"),
|
||||
Some(("rust-lang".into(), "rust".into(), 100))
|
||||
);
|
||||
}
|
||||
}
|
||||
189
crates/webclaw-fetch/src/extractors/github_pr.rs
Normal file
189
crates/webclaw-fetch/src/extractors/github_pr.rs
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
//! GitHub pull request structured extractor.
|
||||
//!
|
||||
//! Uses `api.github.com/repos/{owner}/{repo}/pulls/{number}`. Returns
|
||||
//! the PR metadata + a counted summary of comments and review activity.
|
||||
//! Full diff and per-comment bodies require additional calls — left for
|
||||
//! a follow-up enhancement so the v1 stays one network round-trip.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_pr",
|
||||
label: "GitHub pull request",
|
||||
description: "Returns PR metadata: title, body, state, author, labels, additions/deletions, file count.",
|
||||
url_patterns: &["https://github.com/{owner}/{repo}/pull/{number}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if host != "github.com" && host != "www.github.com" {
|
||||
return false;
|
||||
}
|
||||
parse_pr(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo, number) = parse_pr(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_pr: cannot parse pull-request URL '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://api.github.com/repos/{owner}/{repo}/pulls/{number}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github_pr: pull request '{owner}/{repo}#{number}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(
|
||||
"github_pr: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
|
||||
));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let p: PullRequest = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("github pr parse: {e}")))?;
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"owner": owner,
|
||||
"repo": repo,
|
||||
"number": p.number,
|
||||
"title": p.title,
|
||||
"body": p.body,
|
||||
"state": p.state,
|
||||
"draft": p.draft,
|
||||
"merged": p.merged,
|
||||
"merged_at": p.merged_at,
|
||||
"merge_commit_sha": p.merge_commit_sha,
|
||||
"author": p.user.as_ref().and_then(|u| u.login.clone()),
|
||||
"labels": p.labels.iter().filter_map(|l| l.name.clone()).collect::<Vec<_>>(),
|
||||
"milestone": p.milestone.as_ref().and_then(|m| m.title.clone()),
|
||||
"head_ref": p.head.as_ref().and_then(|r| r.ref_name.clone()),
|
||||
"base_ref": p.base.as_ref().and_then(|r| r.ref_name.clone()),
|
||||
"head_sha": p.head.as_ref().and_then(|r| r.sha.clone()),
|
||||
"additions": p.additions,
|
||||
"deletions": p.deletions,
|
||||
"changed_files": p.changed_files,
|
||||
"commits": p.commits,
|
||||
"comments": p.comments,
|
||||
"review_comments":p.review_comments,
|
||||
"created_at": p.created_at,
|
||||
"updated_at": p.updated_at,
|
||||
"closed_at": p.closed_at,
|
||||
"html_url": p.html_url,
|
||||
}))
|
||||
}
|
||||
|
||||
fn parse_pr(url: &str) -> Option<(String, String, u64)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
// /{owner}/{repo}/pull/{number} (or /pulls/{number} variant)
|
||||
if segs.len() < 4 {
|
||||
return None;
|
||||
}
|
||||
if segs[2] != "pull" && segs[2] != "pulls" {
|
||||
return None;
|
||||
}
|
||||
let number: u64 = segs[3].parse().ok()?;
|
||||
Some((segs[0].to_string(), segs[1].to_string(), number))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// GitHub PR API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct PullRequest {
|
||||
number: Option<i64>,
|
||||
title: Option<String>,
|
||||
body: Option<String>,
|
||||
state: Option<String>,
|
||||
draft: Option<bool>,
|
||||
merged: Option<bool>,
|
||||
merged_at: Option<String>,
|
||||
merge_commit_sha: Option<String>,
|
||||
user: Option<UserRef>,
|
||||
#[serde(default)]
|
||||
labels: Vec<LabelRef>,
|
||||
milestone: Option<Milestone>,
|
||||
head: Option<GitRef>,
|
||||
base: Option<GitRef>,
|
||||
additions: Option<i64>,
|
||||
deletions: Option<i64>,
|
||||
changed_files: Option<i64>,
|
||||
commits: Option<i64>,
|
||||
comments: Option<i64>,
|
||||
review_comments: Option<i64>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
closed_at: Option<String>,
|
||||
html_url: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct UserRef {
|
||||
login: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct LabelRef {
|
||||
name: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Milestone {
|
||||
title: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct GitRef {
|
||||
#[serde(rename = "ref")]
|
||||
ref_name: Option<String>,
|
||||
sha: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_pr_urls() {
|
||||
assert!(matches("https://github.com/rust-lang/rust/pull/12345"));
|
||||
assert!(matches(
|
||||
"https://github.com/rust-lang/rust/pull/12345/files"
|
||||
));
|
||||
assert!(!matches("https://github.com/rust-lang/rust"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/issues/100"));
|
||||
assert!(!matches("https://github.com/rust-lang"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_pr_extracts_owner_repo_number() {
|
||||
assert_eq!(
|
||||
parse_pr("https://github.com/rust-lang/rust/pull/12345"),
|
||||
Some(("rust-lang".into(), "rust".into(), 12345))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_pr("https://github.com/rust-lang/rust/pull/12345/files"),
|
||||
Some(("rust-lang".into(), "rust".into(), 12345))
|
||||
);
|
||||
}
|
||||
}
|
||||
179
crates/webclaw-fetch/src/extractors/github_release.rs
Normal file
179
crates/webclaw-fetch/src/extractors/github_release.rs
Normal file
|
|
@ -0,0 +1,179 @@
|
|||
//! GitHub release structured extractor.
|
||||
//!
|
||||
//! `api.github.com/repos/{owner}/{repo}/releases/tags/{tag}`. Returns
|
||||
//! the release notes body, asset list with download counts, and
|
||||
//! prerelease flag.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_release",
|
||||
label: "GitHub release",
|
||||
description: "Returns release metadata: tag, name, body (release notes), assets with download counts.",
|
||||
url_patterns: &["https://github.com/{owner}/{repo}/releases/tag/{tag}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if host != "github.com" && host != "www.github.com" {
|
||||
return false;
|
||||
}
|
||||
parse_release(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo, tag) = parse_release(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_release: cannot parse release URL '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://api.github.com/repos/{owner}/{repo}/releases/tags/{tag}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github_release: release '{owner}/{repo}@{tag}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(
|
||||
"github_release: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour."
|
||||
.into(),
|
||||
));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let r: Release = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("github release parse: {e}")))?;
|
||||
|
||||
let assets: Vec<Value> = r
|
||||
.assets
|
||||
.iter()
|
||||
.map(|a| {
|
||||
json!({
|
||||
"name": a.name,
|
||||
"size": a.size,
|
||||
"download_count": a.download_count,
|
||||
"browser_download_url": a.browser_download_url,
|
||||
"content_type": a.content_type,
|
||||
"created_at": a.created_at,
|
||||
"updated_at": a.updated_at,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"owner": owner,
|
||||
"repo": repo,
|
||||
"tag_name": r.tag_name,
|
||||
"name": r.name,
|
||||
"body": r.body,
|
||||
"draft": r.draft,
|
||||
"prerelease": r.prerelease,
|
||||
"author": r.author.as_ref().and_then(|u| u.login.clone()),
|
||||
"created_at": r.created_at,
|
||||
"published_at": r.published_at,
|
||||
"asset_count": assets.len(),
|
||||
"total_downloads": r.assets.iter().map(|a| a.download_count.unwrap_or(0)).sum::<i64>(),
|
||||
"assets": assets,
|
||||
"html_url": r.html_url,
|
||||
}))
|
||||
}
|
||||
|
||||
fn parse_release(url: &str) -> Option<(String, String, String)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
// /{owner}/{repo}/releases/tag/{tag}
|
||||
if segs.len() < 5 {
|
||||
return None;
|
||||
}
|
||||
if segs[2] != "releases" || segs[3] != "tag" {
|
||||
return None;
|
||||
}
|
||||
Some((
|
||||
segs[0].to_string(),
|
||||
segs[1].to_string(),
|
||||
segs[4].to_string(),
|
||||
))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// GitHub Release API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Release {
|
||||
tag_name: Option<String>,
|
||||
name: Option<String>,
|
||||
body: Option<String>,
|
||||
draft: Option<bool>,
|
||||
prerelease: Option<bool>,
|
||||
author: Option<UserRef>,
|
||||
created_at: Option<String>,
|
||||
published_at: Option<String>,
|
||||
html_url: Option<String>,
|
||||
#[serde(default)]
|
||||
assets: Vec<Asset>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct UserRef {
|
||||
login: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Asset {
|
||||
name: Option<String>,
|
||||
size: Option<i64>,
|
||||
download_count: Option<i64>,
|
||||
browser_download_url: Option<String>,
|
||||
content_type: Option<String>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_release_urls() {
|
||||
assert!(matches(
|
||||
"https://github.com/rust-lang/rust/releases/tag/1.85.0"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"
|
||||
));
|
||||
assert!(!matches("https://github.com/rust-lang/rust"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/releases"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/pull/100"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_release_extracts_owner_repo_tag() {
|
||||
assert_eq!(
|
||||
parse_release("https://github.com/0xMassi/webclaw/releases/tag/v0.4.0"),
|
||||
Some(("0xMassi".into(), "webclaw".into(), "v0.4.0".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_release("https://github.com/rust-lang/rust/releases/tag/1.85.0/?foo=bar"),
|
||||
Some(("rust-lang".into(), "rust".into(), "1.85.0".into()))
|
||||
);
|
||||
}
|
||||
}
|
||||
212
crates/webclaw-fetch/src/extractors/github_repo.rs
Normal file
212
crates/webclaw-fetch/src/extractors/github_repo.rs
Normal file
|
|
@ -0,0 +1,212 @@
|
|||
//! GitHub repository structured extractor.
|
||||
//!
|
||||
//! Uses GitHub's public REST API at `api.github.com/repos/{owner}/{repo}`.
|
||||
//! Unauthenticated requests get 60/hour per IP, which is fine for users
|
||||
//! self-hosting and for low-volume cloud usage. Production cloud should
|
||||
//! set a `GITHUB_TOKEN` to lift to 5,000/hour, but the extractor doesn't
|
||||
//! depend on it being set — it works open out of the box.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "github_repo",
|
||||
label: "GitHub repository",
|
||||
description: "Returns repo metadata: stars, forks, topics, license, default branch, recent activity.",
|
||||
url_patterns: &["https://github.com/{owner}/{repo}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if host != "github.com" && host != "www.github.com" {
|
||||
return false;
|
||||
}
|
||||
// Path must be exactly /{owner}/{repo} (or with trailing slash). Reject
|
||||
// sub-pages (issues, pulls, blob, etc.) so we don't claim URLs the
|
||||
// future github_issue / github_pr extractors will handle.
|
||||
let path = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.and_then(|s| s.split_once('/'))
|
||||
.map(|(_, p)| p)
|
||||
.unwrap_or("");
|
||||
let stripped = path
|
||||
.split(['?', '#'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
segs.len() == 2 && !RESERVED_OWNERS.contains(&segs[0])
|
||||
}
|
||||
|
||||
/// GitHub uses some top-level paths for non-repo pages.
|
||||
const RESERVED_OWNERS: &[&str] = &[
|
||||
"settings",
|
||||
"marketplace",
|
||||
"explore",
|
||||
"topics",
|
||||
"trending",
|
||||
"collections",
|
||||
"events",
|
||||
"sponsors",
|
||||
"issues",
|
||||
"pulls",
|
||||
"notifications",
|
||||
"new",
|
||||
"organizations",
|
||||
"login",
|
||||
"join",
|
||||
"search",
|
||||
"about",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, repo) = parse_owner_repo(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("github_repo: cannot parse owner/repo from '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://api.github.com/repos/{owner}/{repo}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github_repo: repo '{owner}/{repo}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(
|
||||
"github_repo: rate limited (60/hour unauth). Set GITHUB_TOKEN for 5,000/hour.".into(),
|
||||
));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"github api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let r: Repo = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("github api parse: {e}")))?;
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"owner": r.owner.as_ref().map(|o| &o.login),
|
||||
"name": r.name,
|
||||
"full_name": r.full_name,
|
||||
"description": r.description,
|
||||
"homepage": r.homepage,
|
||||
"language": r.language,
|
||||
"topics": r.topics,
|
||||
"license": r.license.as_ref().and_then(|l| l.spdx_id.clone()),
|
||||
"license_name": r.license.as_ref().map(|l| l.name.clone()),
|
||||
"default_branch": r.default_branch,
|
||||
"stars": r.stargazers_count,
|
||||
"forks": r.forks_count,
|
||||
"watchers": r.subscribers_count,
|
||||
"open_issues": r.open_issues_count,
|
||||
"size_kb": r.size,
|
||||
"archived": r.archived,
|
||||
"fork": r.fork,
|
||||
"is_template": r.is_template,
|
||||
"has_issues": r.has_issues,
|
||||
"has_wiki": r.has_wiki,
|
||||
"has_pages": r.has_pages,
|
||||
"has_discussions": r.has_discussions,
|
||||
"created_at": r.created_at,
|
||||
"updated_at": r.updated_at,
|
||||
"pushed_at": r.pushed_at,
|
||||
"html_url": r.html_url,
|
||||
}))
|
||||
}
|
||||
|
||||
fn parse_owner_repo(url: &str) -> Option<(String, String)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let owner = segs.next()?.to_string();
|
||||
let repo = segs.next()?.to_string();
|
||||
Some((owner, repo))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// GitHub API types — only the fields we surface
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Repo {
|
||||
name: Option<String>,
|
||||
full_name: Option<String>,
|
||||
description: Option<String>,
|
||||
homepage: Option<String>,
|
||||
language: Option<String>,
|
||||
#[serde(default)]
|
||||
topics: Vec<String>,
|
||||
license: Option<License>,
|
||||
default_branch: Option<String>,
|
||||
stargazers_count: Option<i64>,
|
||||
forks_count: Option<i64>,
|
||||
subscribers_count: Option<i64>,
|
||||
open_issues_count: Option<i64>,
|
||||
size: Option<i64>,
|
||||
archived: Option<bool>,
|
||||
fork: Option<bool>,
|
||||
is_template: Option<bool>,
|
||||
has_issues: Option<bool>,
|
||||
has_wiki: Option<bool>,
|
||||
has_pages: Option<bool>,
|
||||
has_discussions: Option<bool>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
pushed_at: Option<String>,
|
||||
html_url: Option<String>,
|
||||
owner: Option<Owner>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Owner {
|
||||
login: String,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct License {
|
||||
name: String,
|
||||
spdx_id: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_repo_root_only() {
|
||||
assert!(matches("https://github.com/rust-lang/rust"));
|
||||
assert!(matches("https://github.com/rust-lang/rust/"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/issues"));
|
||||
assert!(!matches("https://github.com/rust-lang/rust/pulls/123"));
|
||||
assert!(!matches("https://github.com/rust-lang"));
|
||||
assert!(!matches("https://github.com/marketplace"));
|
||||
assert!(!matches("https://github.com/topics/rust"));
|
||||
assert!(!matches("https://example.com/foo/bar"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_owner_repo_handles_trailing_slash_and_query() {
|
||||
assert_eq!(
|
||||
parse_owner_repo("https://github.com/rust-lang/rust"),
|
||||
Some(("rust-lang".into(), "rust".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_owner_repo("https://github.com/rust-lang/rust/?tab=foo"),
|
||||
Some(("rust-lang".into(), "rust".into()))
|
||||
);
|
||||
}
|
||||
}
|
||||
186
crates/webclaw-fetch/src/extractors/hackernews.rs
Normal file
186
crates/webclaw-fetch/src/extractors/hackernews.rs
Normal file
|
|
@ -0,0 +1,186 @@
|
|||
//! Hacker News structured extractor.
|
||||
//!
|
||||
//! Uses Algolia's HN API (`hn.algolia.com/api/v1/items/{id}`) which
|
||||
//! returns the full post + recursive comment tree in a single request.
|
||||
//! The official Firebase API at `hacker-news.firebaseio.com` requires
|
||||
//! N+1 fetches per comment, so we'd hit either timeout or rate-limit
|
||||
//! on any non-trivial thread.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "hackernews",
|
||||
label: "Hacker News story",
|
||||
description: "Returns post + nested comment tree for a Hacker News item.",
|
||||
url_patterns: &[
|
||||
"https://news.ycombinator.com/item?id=N",
|
||||
"https://hn.algolia.com/items/N",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if host == "news.ycombinator.com" {
|
||||
return url.contains("item?id=") || url.contains("item%3Fid=");
|
||||
}
|
||||
if host == "hn.algolia.com" {
|
||||
return url.contains("/items/");
|
||||
}
|
||||
false
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let id = parse_item_id(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("hackernews: cannot parse item id from '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://hn.algolia.com/api/v1/items/{id}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hn algolia returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let item: AlgoliaItem = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("hn algolia parse: {e}")))?;
|
||||
|
||||
let post = post_json(&item);
|
||||
let comments: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"post": post,
|
||||
"comments": comments,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Pull the numeric id out of a HN URL. Handles `item?id=N` and the
|
||||
/// Algolia mirror's `/items/N` form.
|
||||
fn parse_item_id(url: &str) -> Option<u64> {
|
||||
if let Some(after) = url.split("id=").nth(1) {
|
||||
let n = after.split('&').next().unwrap_or(after);
|
||||
if let Ok(id) = n.parse::<u64>() {
|
||||
return Some(id);
|
||||
}
|
||||
}
|
||||
if let Some(after) = url.split("/items/").nth(1) {
|
||||
let n = after.split(['/', '?', '#']).next().unwrap_or(after);
|
||||
if let Ok(id) = n.parse::<u64>() {
|
||||
return Some(id);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn post_json(item: &AlgoliaItem) -> Value {
|
||||
json!({
|
||||
"id": item.id,
|
||||
"type": item.r#type,
|
||||
"title": item.title,
|
||||
"url": item.url,
|
||||
"author": item.author,
|
||||
"points": item.points,
|
||||
"text": item.text, // populated for ask/show/tell
|
||||
"created_at": item.created_at,
|
||||
"created_at_unix": item.created_at_i,
|
||||
"comment_count": count_descendants(item),
|
||||
"permalink": item.id.map(|i| format!("https://news.ycombinator.com/item?id={i}")),
|
||||
})
|
||||
}
|
||||
|
||||
fn comment_json(item: &AlgoliaItem) -> Option<Value> {
|
||||
if !matches!(item.r#type.as_deref(), Some("comment")) {
|
||||
return None;
|
||||
}
|
||||
// Dead/deleted comments still appear in the tree; surface them honestly.
|
||||
let replies: Vec<Value> = item.children.iter().filter_map(comment_json).collect();
|
||||
Some(json!({
|
||||
"id": item.id,
|
||||
"author": item.author,
|
||||
"text": item.text,
|
||||
"created_at": item.created_at,
|
||||
"created_at_unix": item.created_at_i,
|
||||
"parent_id": item.parent_id,
|
||||
"story_id": item.story_id,
|
||||
"replies": replies,
|
||||
}))
|
||||
}
|
||||
|
||||
fn count_descendants(item: &AlgoliaItem) -> usize {
|
||||
item.children
|
||||
.iter()
|
||||
.filter(|c| matches!(c.r#type.as_deref(), Some("comment")))
|
||||
.map(|c| 1 + count_descendants(c))
|
||||
.sum()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Algolia API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct AlgoliaItem {
|
||||
id: Option<u64>,
|
||||
r#type: Option<String>,
|
||||
title: Option<String>,
|
||||
url: Option<String>,
|
||||
author: Option<String>,
|
||||
points: Option<i64>,
|
||||
text: Option<String>,
|
||||
created_at: Option<String>,
|
||||
created_at_i: Option<i64>,
|
||||
parent_id: Option<u64>,
|
||||
story_id: Option<u64>,
|
||||
#[serde(default)]
|
||||
children: Vec<AlgoliaItem>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_hn_item_urls() {
|
||||
assert!(matches("https://news.ycombinator.com/item?id=1"));
|
||||
assert!(matches("https://news.ycombinator.com/item?id=12345"));
|
||||
assert!(matches("https://hn.algolia.com/items/1"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_item_urls() {
|
||||
assert!(!matches("https://news.ycombinator.com/"));
|
||||
assert!(!matches("https://news.ycombinator.com/news"));
|
||||
assert!(!matches("https://example.com/item?id=1"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_item_id_handles_both_forms() {
|
||||
assert_eq!(
|
||||
parse_item_id("https://news.ycombinator.com/item?id=1"),
|
||||
Some(1)
|
||||
);
|
||||
assert_eq!(
|
||||
parse_item_id("https://news.ycombinator.com/item?id=12345&p=2"),
|
||||
Some(12345)
|
||||
);
|
||||
assert_eq!(parse_item_id("https://hn.algolia.com/items/999"), Some(999));
|
||||
assert_eq!(parse_item_id("https://example.com/foo"), None);
|
||||
}
|
||||
}
|
||||
189
crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
Normal file
189
crates/webclaw-fetch/src/extractors/huggingface_dataset.rs
Normal file
|
|
@ -0,0 +1,189 @@
|
|||
//! HuggingFace dataset structured extractor.
|
||||
//!
|
||||
//! Same shape as the model extractor but hits the dataset endpoint.
|
||||
//! `huggingface.co/api/datasets/{owner}/{name}`.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "huggingface_dataset",
|
||||
label: "HuggingFace dataset",
|
||||
description: "Returns dataset metadata: downloads, likes, license, language, task categories, file list.",
|
||||
url_patterns: &["https://huggingface.co/datasets/{owner}/{name}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "huggingface.co" && host != "www.huggingface.co" {
|
||||
return false;
|
||||
}
|
||||
let path = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.and_then(|s| s.split_once('/'))
|
||||
.map(|(_, p)| p)
|
||||
.unwrap_or("");
|
||||
let stripped = path
|
||||
.split(['?', '#'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
// /datasets/{name} (legacy top-level) or /datasets/{owner}/{name} (canonical).
|
||||
segs.first().copied() == Some("datasets") && (segs.len() == 2 || segs.len() == 3)
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let dataset_path = parse_dataset_path(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"hf_dataset: cannot parse dataset path from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://huggingface.co/api/datasets/{dataset_path}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf_dataset: '{dataset_path}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 401 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf_dataset: '{dataset_path}' requires authentication (gated)"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf_dataset api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let d: DatasetInfo = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("hf_dataset parse: {e}")))?;
|
||||
|
||||
let files: Vec<Value> = d
|
||||
.siblings
|
||||
.iter()
|
||||
.map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
|
||||
.collect();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"id": d.id,
|
||||
"private": d.private,
|
||||
"gated": d.gated,
|
||||
"downloads": d.downloads,
|
||||
"downloads_30d": d.downloads_all_time,
|
||||
"likes": d.likes,
|
||||
"tags": d.tags,
|
||||
"license": d.card_data.as_ref().and_then(|c| c.license.clone()),
|
||||
"language": d.card_data.as_ref().and_then(|c| c.language.clone()),
|
||||
"task_categories": d.card_data.as_ref().and_then(|c| c.task_categories.clone()),
|
||||
"size_categories": d.card_data.as_ref().and_then(|c| c.size_categories.clone()),
|
||||
"annotations_creators": d.card_data.as_ref().and_then(|c| c.annotations_creators.clone()),
|
||||
"configs": d.card_data.as_ref().and_then(|c| c.configs.clone()),
|
||||
"created_at": d.created_at,
|
||||
"last_modified": d.last_modified,
|
||||
"sha": d.sha,
|
||||
"file_count": d.siblings.len(),
|
||||
"files": files,
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Returns the part to append to the API URL — either `name` (legacy
|
||||
/// top-level dataset like `squad`) or `owner/name` (canonical form).
|
||||
fn parse_dataset_path(url: &str) -> Option<String> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
if segs.next() != Some("datasets") {
|
||||
return None;
|
||||
}
|
||||
let first = segs.next()?.to_string();
|
||||
match segs.next() {
|
||||
Some(second) => Some(format!("{first}/{second}")),
|
||||
None => Some(first),
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct DatasetInfo {
|
||||
id: Option<String>,
|
||||
private: Option<bool>,
|
||||
gated: Option<serde_json::Value>,
|
||||
downloads: Option<i64>,
|
||||
#[serde(rename = "downloadsAllTime")]
|
||||
downloads_all_time: Option<i64>,
|
||||
likes: Option<i64>,
|
||||
#[serde(default)]
|
||||
tags: Vec<String>,
|
||||
#[serde(rename = "createdAt")]
|
||||
created_at: Option<String>,
|
||||
#[serde(rename = "lastModified")]
|
||||
last_modified: Option<String>,
|
||||
sha: Option<String>,
|
||||
#[serde(rename = "cardData")]
|
||||
card_data: Option<DatasetCard>,
|
||||
#[serde(default)]
|
||||
siblings: Vec<Sibling>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct DatasetCard {
|
||||
license: Option<serde_json::Value>,
|
||||
language: Option<serde_json::Value>,
|
||||
task_categories: Option<serde_json::Value>,
|
||||
size_categories: Option<serde_json::Value>,
|
||||
annotations_creators: Option<serde_json::Value>,
|
||||
configs: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Sibling {
|
||||
rfilename: String,
|
||||
size: Option<i64>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_dataset_pages() {
|
||||
assert!(matches("https://huggingface.co/datasets/squad")); // legacy top-level
|
||||
assert!(matches("https://huggingface.co/datasets/openai/gsm8k")); // canonical owner/name
|
||||
assert!(!matches("https://huggingface.co/openai/whisper-large-v3"));
|
||||
assert!(!matches("https://huggingface.co/datasets/"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_dataset_path_works() {
|
||||
assert_eq!(
|
||||
parse_dataset_path("https://huggingface.co/datasets/squad"),
|
||||
Some("squad".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k"),
|
||||
Some("openai/gsm8k".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_dataset_path("https://huggingface.co/datasets/openai/gsm8k/?lib=transformers"),
|
||||
Some("openai/gsm8k".into())
|
||||
);
|
||||
}
|
||||
}
|
||||
223
crates/webclaw-fetch/src/extractors/huggingface_model.rs
Normal file
223
crates/webclaw-fetch/src/extractors/huggingface_model.rs
Normal file
|
|
@ -0,0 +1,223 @@
|
|||
//! HuggingFace model card structured extractor.
|
||||
//!
|
||||
//! Uses the public model API at `huggingface.co/api/models/{owner}/{name}`.
|
||||
//! Returns metadata + the parsed model card front matter, but does not
|
||||
//! pull the full README body — those are sometimes 100KB+ and the user
|
||||
//! can hit /v1/scrape if they want it as markdown.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "huggingface_model",
|
||||
label: "HuggingFace model",
|
||||
description: "Returns model metadata: downloads, likes, license, pipeline tag, library name, file list.",
|
||||
url_patterns: &["https://huggingface.co/{owner}/{name}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "huggingface.co" && host != "www.huggingface.co" {
|
||||
return false;
|
||||
}
|
||||
let path = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.and_then(|s| s.split_once('/'))
|
||||
.map(|(_, p)| p)
|
||||
.unwrap_or("");
|
||||
let stripped = path
|
||||
.split(['?', '#'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
// /{owner}/{name} but reject HF-internal sections + sub-pages.
|
||||
if segs.len() != 2 {
|
||||
return false;
|
||||
}
|
||||
!RESERVED_NAMESPACES.contains(&segs[0])
|
||||
}
|
||||
|
||||
const RESERVED_NAMESPACES: &[&str] = &[
|
||||
"datasets",
|
||||
"spaces",
|
||||
"blog",
|
||||
"docs",
|
||||
"api",
|
||||
"models",
|
||||
"papers",
|
||||
"pricing",
|
||||
"tasks",
|
||||
"join",
|
||||
"login",
|
||||
"settings",
|
||||
"organizations",
|
||||
"new",
|
||||
"search",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (owner, name) = parse_owner_name(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("hf model: cannot parse owner/name from '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = format!("https://huggingface.co/api/models/{owner}/{name}");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf model: '{owner}/{name}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status == 401 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf model: '{owner}/{name}' requires authentication (gated repo)"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"hf api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let m: ModelInfo = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("hf api parse: {e}")))?;
|
||||
|
||||
// Surface a flat file list — full siblings can be hundreds of entries
|
||||
// for big repos. We keep it as-is because callers want to know about
|
||||
// every shard; if it bloats responses too much we'll add pagination.
|
||||
let files: Vec<Value> = m
|
||||
.siblings
|
||||
.iter()
|
||||
.map(|s| json!({"rfilename": s.rfilename, "size": s.size}))
|
||||
.collect();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"id": m.id,
|
||||
"model_id": m.model_id,
|
||||
"private": m.private,
|
||||
"gated": m.gated,
|
||||
"downloads": m.downloads,
|
||||
"downloads_30d": m.downloads_all_time,
|
||||
"likes": m.likes,
|
||||
"library_name": m.library_name,
|
||||
"pipeline_tag": m.pipeline_tag,
|
||||
"tags": m.tags,
|
||||
"license": m.card_data.as_ref().and_then(|c| c.license.clone()),
|
||||
"language": m.card_data.as_ref().and_then(|c| c.language.clone()),
|
||||
"datasets": m.card_data.as_ref().and_then(|c| c.datasets.clone()),
|
||||
"base_model": m.card_data.as_ref().and_then(|c| c.base_model.clone()),
|
||||
"model_type": m.card_data.as_ref().and_then(|c| c.model_type.clone()),
|
||||
"created_at": m.created_at,
|
||||
"last_modified": m.last_modified,
|
||||
"sha": m.sha,
|
||||
"file_count": m.siblings.len(),
|
||||
"files": files,
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_owner_name(url: &str) -> Option<(String, String)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let owner = segs.next()?.to_string();
|
||||
let name = segs.next()?.to_string();
|
||||
Some((owner, name))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// HF API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ModelInfo {
|
||||
id: Option<String>,
|
||||
#[serde(rename = "modelId")]
|
||||
model_id: Option<String>,
|
||||
private: Option<bool>,
|
||||
gated: Option<serde_json::Value>, // bool or string ("auto" / "manual" / false)
|
||||
downloads: Option<i64>,
|
||||
#[serde(rename = "downloadsAllTime")]
|
||||
downloads_all_time: Option<i64>,
|
||||
likes: Option<i64>,
|
||||
#[serde(rename = "library_name")]
|
||||
library_name: Option<String>,
|
||||
#[serde(rename = "pipeline_tag")]
|
||||
pipeline_tag: Option<String>,
|
||||
#[serde(default)]
|
||||
tags: Vec<String>,
|
||||
#[serde(rename = "createdAt")]
|
||||
created_at: Option<String>,
|
||||
#[serde(rename = "lastModified")]
|
||||
last_modified: Option<String>,
|
||||
sha: Option<String>,
|
||||
#[serde(rename = "cardData")]
|
||||
card_data: Option<CardData>,
|
||||
#[serde(default)]
|
||||
siblings: Vec<Sibling>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CardData {
|
||||
license: Option<serde_json::Value>, // string or array
|
||||
language: Option<serde_json::Value>,
|
||||
datasets: Option<serde_json::Value>,
|
||||
#[serde(rename = "base_model")]
|
||||
base_model: Option<serde_json::Value>,
|
||||
#[serde(rename = "model_type")]
|
||||
model_type: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Sibling {
|
||||
rfilename: String,
|
||||
size: Option<i64>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_model_pages() {
|
||||
assert!(matches("https://huggingface.co/meta-llama/Meta-Llama-3-8B"));
|
||||
assert!(matches("https://huggingface.co/openai/whisper-large-v3"));
|
||||
assert!(matches("https://huggingface.co/bert-base-uncased/main")); // owner=bert-base-uncased name=main: false positive but acceptable for v1
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_hf_section_pages() {
|
||||
assert!(!matches("https://huggingface.co/datasets/squad"));
|
||||
assert!(!matches("https://huggingface.co/spaces/foo/bar"));
|
||||
assert!(!matches("https://huggingface.co/blog/intro"));
|
||||
assert!(!matches("https://huggingface.co/"));
|
||||
assert!(!matches("https://huggingface.co/meta-llama"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_owner_name_pulls_both() {
|
||||
assert_eq!(
|
||||
parse_owner_name("https://huggingface.co/meta-llama/Meta-Llama-3-8B"),
|
||||
Some(("meta-llama".into(), "Meta-Llama-3-8B".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_owner_name("https://huggingface.co/openai/whisper-large-v3?library=transformers"),
|
||||
Some(("openai".into(), "whisper-large-v3".into()))
|
||||
);
|
||||
}
|
||||
}
|
||||
235
crates/webclaw-fetch/src/extractors/instagram_post.rs
Normal file
235
crates/webclaw-fetch/src/extractors/instagram_post.rs
Normal file
|
|
@ -0,0 +1,235 @@
|
|||
//! Instagram post structured extractor.
|
||||
//!
|
||||
//! Uses Instagram's public embed endpoint
|
||||
//! `/p/{shortcode}/embed/captioned/` which returns SSR HTML with the
|
||||
//! full caption, author username, and thumbnail. No auth required.
|
||||
//! The same endpoint serves reels and IGTV under `/reel/{code}` and
|
||||
//! `/tv/{code}` URLs (we accept all three).
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "instagram_post",
|
||||
label: "Instagram post",
|
||||
description: "Returns full caption, author username, thumbnail, and post type (post / reel / tv) via Instagram's public embed.",
|
||||
url_patterns: &[
|
||||
"https://www.instagram.com/p/{shortcode}/",
|
||||
"https://www.instagram.com/reel/{shortcode}/",
|
||||
"https://www.instagram.com/tv/{shortcode}/",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !matches!(host, "www.instagram.com" | "instagram.com") {
|
||||
return false;
|
||||
}
|
||||
parse_shortcode(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (kind, shortcode) = parse_shortcode(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"instagram_post: cannot parse shortcode from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
// Instagram serves the same embed HTML for posts/reels/tv under /p/.
|
||||
let embed_url = format!("https://www.instagram.com/p/{shortcode}/embed/captioned/");
|
||||
let resp = client.fetch(&embed_url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"instagram embed returned status {} for {shortcode}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let html = &resp.html;
|
||||
let username = parse_username(html);
|
||||
let caption = parse_caption(html);
|
||||
let thumbnail = parse_thumbnail(html);
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"embed_url": embed_url,
|
||||
"shortcode": shortcode,
|
||||
"kind": kind,
|
||||
"data_completeness": "embed",
|
||||
"author_username": username,
|
||||
"caption": caption,
|
||||
"thumbnail_url": thumbnail,
|
||||
"canonical_url": format!("https://www.instagram.com/{}/{shortcode}/", path_segment_for(kind)),
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL parsing
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Returns `(kind, shortcode)` where kind ∈ {`post`, `reel`, `tv`}.
|
||||
fn parse_shortcode(url: &str) -> Option<(&'static str, String)> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let first = segs.next()?;
|
||||
let kind = match first {
|
||||
"p" => "post",
|
||||
"reel" | "reels" => "reel",
|
||||
"tv" => "tv",
|
||||
_ => return None,
|
||||
};
|
||||
let shortcode = segs.next()?;
|
||||
if shortcode.is_empty() {
|
||||
return None;
|
||||
}
|
||||
Some((kind, shortcode.to_string()))
|
||||
}
|
||||
|
||||
fn path_segment_for(kind: &str) -> &'static str {
|
||||
match kind {
|
||||
"reel" => "reel",
|
||||
"tv" => "tv",
|
||||
_ => "p",
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// HTML scraping
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Username appears as the anchor text inside `<a class="CaptionUsername">`.
|
||||
fn parse_username(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r#"(?s)class="CaptionUsername"[^>]*>([^<]+)<"#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| html_decode(m.as_str().trim()))
|
||||
}
|
||||
|
||||
/// Caption sits inside `<div class="Caption">` after the username anchor.
|
||||
/// We grab the whole Caption block and strip out the username link, time
|
||||
/// node, and any trailing "Photo by" / "View ... on Instagram" boilerplate.
|
||||
fn parse_caption(html: &str) -> Option<String> {
|
||||
static RE_OUTER: OnceLock<Regex> = OnceLock::new();
|
||||
let outer = RE_OUTER
|
||||
.get_or_init(|| Regex::new(r#"(?s)<div\s+class="Caption"[^>]*>(.*?)</div>"#).unwrap());
|
||||
let block = outer.captures(html)?.get(1)?.as_str();
|
||||
|
||||
// Strip everything wrapped in <a class="CaptionUsername">...</a>.
|
||||
static RE_USER: OnceLock<Regex> = OnceLock::new();
|
||||
let user_re = RE_USER
|
||||
.get_or_init(|| Regex::new(r#"(?s)<a[^>]*class="CaptionUsername"[^>]*>.*?</a>"#).unwrap());
|
||||
let stripped = user_re.replace_all(block, "");
|
||||
|
||||
// Then strip anything remaining tagged.
|
||||
static RE_TAGS: OnceLock<Regex> = OnceLock::new();
|
||||
let tag_re = RE_TAGS.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
|
||||
let text = tag_re.replace_all(&stripped, " ");
|
||||
|
||||
let cleaned = collapse_whitespace(&html_decode(text.trim()));
|
||||
if cleaned.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(cleaned)
|
||||
}
|
||||
}
|
||||
|
||||
/// Thumbnail is the `<img class="EmbeddedMediaImage">` inside the embed
|
||||
/// (or the og:image as fallback).
|
||||
fn parse_thumbnail(html: &str) -> Option<String> {
|
||||
static RE_IMG: OnceLock<Regex> = OnceLock::new();
|
||||
let img_re = RE_IMG.get_or_init(|| {
|
||||
Regex::new(r#"(?s)<img[^>]+class="[^"]*EmbeddedMediaImage[^"]*"[^>]+src="([^"]+)""#)
|
||||
.unwrap()
|
||||
});
|
||||
if let Some(m) = img_re.captures(html).and_then(|c| c.get(1)) {
|
||||
return Some(html_decode(m.as_str()));
|
||||
}
|
||||
static RE_OG: OnceLock<Regex> = OnceLock::new();
|
||||
let og_re = RE_OG.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:image"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
og_re
|
||||
.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| html_decode(m.as_str()))
|
||||
}
|
||||
|
||||
fn html_decode(s: &str) -> String {
|
||||
s.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
.replace(""", "\"")
|
||||
.replace("'", "'")
|
||||
.replace("@", "@")
|
||||
.replace("•", "•")
|
||||
.replace("…", "…")
|
||||
}
|
||||
|
||||
fn collapse_whitespace(s: &str) -> String {
|
||||
s.split_whitespace().collect::<Vec<_>>().join(" ")
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_post_reel_tv_urls() {
|
||||
assert!(matches("https://www.instagram.com/p/DT-RICMjeK5/"));
|
||||
assert!(matches(
|
||||
"https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"
|
||||
));
|
||||
assert!(matches("https://www.instagram.com/reel/abc123/"));
|
||||
assert!(matches("https://www.instagram.com/tv/abc123/"));
|
||||
assert!(!matches("https://www.instagram.com/ticketswave"));
|
||||
assert!(!matches("https://www.instagram.com/"));
|
||||
assert!(!matches("https://example.com/p/abc/"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_shortcode_reads_each_kind() {
|
||||
assert_eq!(
|
||||
parse_shortcode("https://www.instagram.com/p/DT-RICMjeK5/?img_index=1"),
|
||||
Some(("post", "DT-RICMjeK5".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_shortcode("https://www.instagram.com/reel/abc123/"),
|
||||
Some(("reel", "abc123".into()))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_shortcode("https://www.instagram.com/tv/abc123"),
|
||||
Some(("tv", "abc123".into()))
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_username_pulls_anchor_text() {
|
||||
let html = r#"<a class="CaptionUsername" href="...">ticketswave</a>"#;
|
||||
assert_eq!(parse_username(html).as_deref(), Some("ticketswave"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_caption_strips_username_anchor() {
|
||||
let html = r#"<div class="Caption"><a class="CaptionUsername" href="...">ticketswave</a> Some caption text here</div>"#;
|
||||
assert_eq!(
|
||||
parse_caption(html).as_deref(),
|
||||
Some("Some caption text here")
|
||||
);
|
||||
}
|
||||
}
|
||||
465
crates/webclaw-fetch/src/extractors/instagram_profile.rs
Normal file
465
crates/webclaw-fetch/src/extractors/instagram_profile.rs
Normal file
|
|
@ -0,0 +1,465 @@
|
|||
//! Instagram profile structured extractor.
|
||||
//!
|
||||
//! Hits Instagram's internal `web_profile_info` endpoint at
|
||||
//! `instagram.com/api/v1/users/web_profile_info/?username=X`. The
|
||||
//! `x-ig-app-id` header is Instagram's own public web-app id (not a
|
||||
//! secret) — the same value Instagram's own JavaScript bundle sends.
|
||||
//!
|
||||
//! Returns the full profile (bio, exact follower count, verified /
|
||||
//! business flags, profile picture) plus the **12 most recent posts**
|
||||
//! with shortcodes, like counts, types, thumbnails, and caption
|
||||
//! previews. Callers can fan out to `/v1/scrape/instagram_post` per
|
||||
//! shortcode to get the full caption + media.
|
||||
//!
|
||||
//! Pagination beyond 12 requires authenticated cookies + a CSRF token;
|
||||
//! we accept that as the practical ceiling for the unauth path. The
|
||||
//! cloud (with stored sessions) can paginate later as a follow-up.
|
||||
//!
|
||||
//! Falls back to OG-tag scraping of the public profile page if the API
|
||||
//! returns 401/403 — Instagram has tightened this endpoint multiple
|
||||
//! times, so we keep the second path warm.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "instagram_profile",
|
||||
label: "Instagram profile",
|
||||
description: "Returns full profile metadata + the 12 most recent posts (shortcode, url, type, likes, thumbnail).",
|
||||
url_patterns: &["https://www.instagram.com/{username}/"],
|
||||
};
|
||||
|
||||
/// Instagram's own public web-app identifier. Sent by their JS bundle
|
||||
/// on every API call, accepted by the unauth endpoint, not a secret.
|
||||
const IG_APP_ID: &str = "936619743392459";
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !matches!(host, "www.instagram.com" | "instagram.com") {
|
||||
return false;
|
||||
}
|
||||
let path = url
|
||||
.split("://")
|
||||
.nth(1)
|
||||
.and_then(|s| s.split_once('/'))
|
||||
.map(|(_, p)| p)
|
||||
.unwrap_or("");
|
||||
let stripped = path
|
||||
.split(['?', '#'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
let segs: Vec<&str> = stripped.split('/').filter(|s| !s.is_empty()).collect();
|
||||
segs.len() == 1 && !RESERVED.contains(&segs[0])
|
||||
}
|
||||
|
||||
const RESERVED: &[&str] = &[
|
||||
"p",
|
||||
"reel",
|
||||
"reels",
|
||||
"tv",
|
||||
"explore",
|
||||
"stories",
|
||||
"directory",
|
||||
"accounts",
|
||||
"about",
|
||||
"developer",
|
||||
"press",
|
||||
"api",
|
||||
"ads",
|
||||
"blog",
|
||||
"fragments",
|
||||
"terms",
|
||||
"privacy",
|
||||
"session",
|
||||
"login",
|
||||
"signup",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let username = parse_username(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"instagram_profile: cannot parse username from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
let api_url =
|
||||
format!("https://www.instagram.com/api/v1/users/web_profile_info/?username={username}");
|
||||
let extra_headers: &[(&str, &str)] = &[
|
||||
("x-ig-app-id", IG_APP_ID),
|
||||
("accept", "*/*"),
|
||||
("sec-fetch-site", "same-origin"),
|
||||
("x-requested-with", "XMLHttpRequest"),
|
||||
];
|
||||
let resp = client.fetch_with_headers(&api_url, extra_headers).await?;
|
||||
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"instagram_profile: '{username}' not found"
|
||||
)));
|
||||
}
|
||||
// Auth wall fallback: Instagram occasionally tightens this endpoint
|
||||
// and starts returning 401/403/302 to a login page. When that
|
||||
// happens we still want to give the caller something useful — the
|
||||
// OG tags from the public HTML page (no posts list, but bio etc).
|
||||
if !(200..300).contains(&resp.status) {
|
||||
return og_fallback(client, &username, url, resp.status).await;
|
||||
}
|
||||
|
||||
let body: ApiResponse = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("instagram_profile parse: {e}")))?;
|
||||
let user = body.data.user;
|
||||
|
||||
let recent_posts: Vec<Value> = user
|
||||
.edge_owner_to_timeline_media
|
||||
.as_ref()
|
||||
.map(|m| m.edges.iter().map(|e| post_summary(&e.node)).collect())
|
||||
.unwrap_or_default();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"canonical_url": format!("https://www.instagram.com/{username}/"),
|
||||
"username": user.username.unwrap_or(username),
|
||||
"data_completeness": "api",
|
||||
"user_id": user.id,
|
||||
"full_name": user.full_name,
|
||||
"biography": user.biography,
|
||||
"biography_links": user.bio_links,
|
||||
"external_url": user.external_url,
|
||||
"category": user.category_name,
|
||||
"follower_count": user.edge_followed_by.map(|c| c.count),
|
||||
"following_count": user.edge_follow.map(|c| c.count),
|
||||
"post_count": user.edge_owner_to_timeline_media.as_ref().map(|m| m.count),
|
||||
"is_verified": user.is_verified,
|
||||
"is_private": user.is_private,
|
||||
"is_business": user.is_business_account,
|
||||
"is_professional": user.is_professional_account,
|
||||
"profile_pic_url": user.profile_pic_url_hd.or(user.profile_pic_url),
|
||||
"recent_posts": recent_posts,
|
||||
}))
|
||||
}
|
||||
|
||||
/// Build the per-post summary the caller fans out from. Includes a
|
||||
/// constructed `url` so the loop is `for p in recent_posts: scrape('instagram_post', p.url)`.
|
||||
fn post_summary(n: &MediaNode) -> Value {
|
||||
let kind = classify(n);
|
||||
let url = match kind {
|
||||
"reel" => format!(
|
||||
"https://www.instagram.com/reel/{}/",
|
||||
n.shortcode.as_deref().unwrap_or("")
|
||||
),
|
||||
_ => format!(
|
||||
"https://www.instagram.com/p/{}/",
|
||||
n.shortcode.as_deref().unwrap_or("")
|
||||
),
|
||||
};
|
||||
let caption = n
|
||||
.edge_media_to_caption
|
||||
.as_ref()
|
||||
.and_then(|c| c.edges.first())
|
||||
.and_then(|e| e.node.text.clone());
|
||||
json!({
|
||||
"shortcode": n.shortcode,
|
||||
"url": url,
|
||||
"kind": kind,
|
||||
"is_video": n.is_video.unwrap_or(false),
|
||||
"video_views": n.video_view_count,
|
||||
"thumbnail_url": n.thumbnail_src.clone().or_else(|| n.display_url.clone()),
|
||||
"display_url": n.display_url,
|
||||
"like_count": n.edge_media_preview_like.as_ref().map(|c| c.count),
|
||||
"comment_count": n.edge_media_to_comment.as_ref().map(|c| c.count),
|
||||
"taken_at": n.taken_at_timestamp,
|
||||
"caption": caption,
|
||||
"alt_text": n.accessibility_caption,
|
||||
"dimensions": n.dimensions.as_ref().map(|d| json!({"width": d.width, "height": d.height})),
|
||||
"product_type": n.product_type,
|
||||
})
|
||||
}
|
||||
|
||||
/// Best-effort post-type classification. `clips` is reels; `feed` is
|
||||
/// the regular grid. Sidecar = multi-photo carousel.
|
||||
fn classify(n: &MediaNode) -> &'static str {
|
||||
if n.product_type.as_deref() == Some("clips") {
|
||||
return "reel";
|
||||
}
|
||||
match n.typename.as_deref() {
|
||||
Some("GraphSidecar") => "carousel",
|
||||
Some("GraphVideo") => "video",
|
||||
Some("GraphImage") => "photo",
|
||||
_ => "post",
|
||||
}
|
||||
}
|
||||
|
||||
/// Fallback when the API path is blocked: hit the public profile HTML,
|
||||
/// pull whatever OG tags we can. Returns less data and explicitly
|
||||
/// flags `data_completeness: "og_only"` so callers know.
|
||||
async fn og_fallback(
|
||||
client: &dyn Fetcher,
|
||||
username: &str,
|
||||
original_url: &str,
|
||||
api_status: u16,
|
||||
) -> Result<Value, FetchError> {
|
||||
let canonical = format!("https://www.instagram.com/{username}/");
|
||||
let resp = client.fetch(&canonical).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"instagram_profile: api status {api_status}, html status {} for {username}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
let og = parse_og_tags(&resp.html);
|
||||
let (followers, following, posts) =
|
||||
parse_counts_from_og_description(og.get("description").map(String::as_str));
|
||||
|
||||
Ok(json!({
|
||||
"url": original_url,
|
||||
"canonical_url": canonical,
|
||||
"username": username,
|
||||
"data_completeness": "og_only",
|
||||
"fallback_reason": format!("api returned {api_status}"),
|
||||
"full_name": parse_full_name(&og.get("title").cloned().unwrap_or_default()),
|
||||
"follower_count": followers,
|
||||
"following_count": following,
|
||||
"post_count": posts,
|
||||
"profile_pic_url": og.get("image").cloned(),
|
||||
"biography": null_value(),
|
||||
"is_verified": null_value(),
|
||||
"is_business": null_value(),
|
||||
"recent_posts": Vec::<Value>::new(),
|
||||
}))
|
||||
}
|
||||
|
||||
fn null_value() -> Value {
|
||||
Value::Null
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL parsing
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_username(url: &str) -> Option<String> {
|
||||
let path = url.split("://").nth(1)?.split_once('/').map(|(_, p)| p)?;
|
||||
let stripped = path.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
stripped
|
||||
.split('/')
|
||||
.find(|s| !s.is_empty())
|
||||
.map(|s| s.to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// OG-fallback helpers (kept self-contained — same shape as the previous
|
||||
// version we shipped, retained as the safety net)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
|
||||
use regex::Regex;
|
||||
use std::sync::OnceLock;
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
let mut out = std::collections::HashMap::new();
|
||||
for c in re.captures_iter(html) {
|
||||
let k = c
|
||||
.get(1)
|
||||
.map(|m| m.as_str().to_lowercase())
|
||||
.unwrap_or_default();
|
||||
let v = c
|
||||
.get(2)
|
||||
.map(|m| html_decode(m.as_str()))
|
||||
.unwrap_or_default();
|
||||
out.entry(k).or_insert(v);
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
fn parse_full_name(og_title: &str) -> Option<String> {
|
||||
if og_title.is_empty() {
|
||||
return None;
|
||||
}
|
||||
let decoded = html_decode(og_title);
|
||||
let trimmed = decoded.split('(').next().unwrap_or(&decoded).trim();
|
||||
if trimmed.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(trimmed.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
fn parse_counts_from_og_description(desc: Option<&str>) -> (Option<i64>, Option<i64>, Option<i64>) {
|
||||
let Some(text) = desc else {
|
||||
return (None, None, None);
|
||||
};
|
||||
let decoded = html_decode(text);
|
||||
use regex::Regex;
|
||||
use std::sync::OnceLock;
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r"(?i)([\d.,]+[KMB]?)\s*Followers,\s*([\d.,]+[KMB]?)\s*Following,\s*([\d.,]+[KMB]?)\s*Posts").unwrap()
|
||||
});
|
||||
if let Some(c) = re.captures(&decoded) {
|
||||
return (
|
||||
c.get(1).and_then(|m| parse_compact_number(m.as_str())),
|
||||
c.get(2).and_then(|m| parse_compact_number(m.as_str())),
|
||||
c.get(3).and_then(|m| parse_compact_number(m.as_str())),
|
||||
);
|
||||
}
|
||||
(None, None, None)
|
||||
}
|
||||
|
||||
fn parse_compact_number(s: &str) -> Option<i64> {
|
||||
let s = s.trim();
|
||||
let (num_str, mul) = match s.chars().last() {
|
||||
Some('K') => (&s[..s.len() - 1], 1_000i64),
|
||||
Some('M') => (&s[..s.len() - 1], 1_000_000i64),
|
||||
Some('B') => (&s[..s.len() - 1], 1_000_000_000i64),
|
||||
_ => (s, 1i64),
|
||||
};
|
||||
let cleaned: String = num_str.chars().filter(|c| *c != ',').collect();
|
||||
cleaned.parse::<f64>().ok().map(|f| (f * mul as f64) as i64)
|
||||
}
|
||||
|
||||
fn html_decode(s: &str) -> String {
|
||||
s.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
.replace(""", "\"")
|
||||
.replace("'", "'")
|
||||
.replace("@", "@")
|
||||
.replace("•", "•")
|
||||
.replace("…", "…")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Instagram web_profile_info API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ApiResponse {
|
||||
data: ApiData,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ApiData {
|
||||
user: User,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct User {
|
||||
id: Option<String>,
|
||||
username: Option<String>,
|
||||
full_name: Option<String>,
|
||||
biography: Option<String>,
|
||||
bio_links: Option<Vec<serde_json::Value>>,
|
||||
external_url: Option<String>,
|
||||
category_name: Option<String>,
|
||||
profile_pic_url: Option<String>,
|
||||
profile_pic_url_hd: Option<String>,
|
||||
is_verified: Option<bool>,
|
||||
is_private: Option<bool>,
|
||||
is_business_account: Option<bool>,
|
||||
is_professional_account: Option<bool>,
|
||||
edge_followed_by: Option<EdgeCount>,
|
||||
edge_follow: Option<EdgeCount>,
|
||||
edge_owner_to_timeline_media: Option<MediaEdges>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct EdgeCount {
|
||||
count: i64,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct MediaEdges {
|
||||
count: i64,
|
||||
edges: Vec<MediaEdge>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct MediaEdge {
|
||||
node: MediaNode,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct MediaNode {
|
||||
#[serde(rename = "__typename")]
|
||||
typename: Option<String>,
|
||||
shortcode: Option<String>,
|
||||
is_video: Option<bool>,
|
||||
video_view_count: Option<i64>,
|
||||
display_url: Option<String>,
|
||||
thumbnail_src: Option<String>,
|
||||
accessibility_caption: Option<String>,
|
||||
taken_at_timestamp: Option<i64>,
|
||||
product_type: Option<String>,
|
||||
dimensions: Option<Dimensions>,
|
||||
edge_media_preview_like: Option<EdgeCount>,
|
||||
edge_media_to_comment: Option<EdgeCount>,
|
||||
edge_media_to_caption: Option<CaptionEdges>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Dimensions {
|
||||
width: i64,
|
||||
height: i64,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CaptionEdges {
|
||||
edges: Vec<CaptionEdge>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CaptionEdge {
|
||||
node: CaptionNode,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct CaptionNode {
|
||||
text: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_profile_urls() {
|
||||
assert!(matches("https://www.instagram.com/ticketswave"));
|
||||
assert!(matches("https://www.instagram.com/ticketswave/"));
|
||||
assert!(matches("https://instagram.com/0xmassi/?hl=en"));
|
||||
assert!(!matches("https://www.instagram.com/p/DT-RICMjeK5/"));
|
||||
assert!(!matches("https://www.instagram.com/explore"));
|
||||
assert!(!matches("https://www.instagram.com/"));
|
||||
assert!(!matches("https://example.com/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_full_name_strips_handle() {
|
||||
assert_eq!(
|
||||
parse_full_name("Ticket Wave (@ticketswave) • Instagram photos and videos"),
|
||||
Some("Ticket Wave".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn compact_number_handles_kmb() {
|
||||
assert_eq!(parse_compact_number("18K"), Some(18_000));
|
||||
assert_eq!(parse_compact_number("1.5M"), Some(1_500_000));
|
||||
assert_eq!(parse_compact_number("1,234"), Some(1_234));
|
||||
assert_eq!(parse_compact_number("641"), Some(641));
|
||||
}
|
||||
}
|
||||
266
crates/webclaw-fetch/src/extractors/linkedin_post.rs
Normal file
266
crates/webclaw-fetch/src/extractors/linkedin_post.rs
Normal file
|
|
@ -0,0 +1,266 @@
|
|||
//! LinkedIn post structured extractor.
|
||||
//!
|
||||
//! Uses the public embed endpoint `/embed/feed/update/{urn}` which
|
||||
//! LinkedIn provides for sites that want to render a post inline. No
|
||||
//! auth required, returns SSR HTML with the full post body, OG tags,
|
||||
//! image, and a link back to the original post.
|
||||
//!
|
||||
//! Accepts both URN forms (`urn:li:share:N` and `urn:li:activity:N`)
|
||||
//! and pretty post URLs (`/posts/{user}_{slug}-{id}-{suffix}`) by
|
||||
//! pulling the trailing numeric id and converting to an activity URN.
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "linkedin_post",
|
||||
label: "LinkedIn post",
|
||||
description: "Returns post body, author name, image, and original URL via LinkedIn's public embed endpoint.",
|
||||
url_patterns: &[
|
||||
"https://www.linkedin.com/feed/update/urn:li:share:{id}",
|
||||
"https://www.linkedin.com/feed/update/urn:li:activity:{id}",
|
||||
"https://www.linkedin.com/posts/{user}_{slug}-{id}-{suffix}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !matches!(host, "www.linkedin.com" | "linkedin.com") {
|
||||
return false;
|
||||
}
|
||||
url.contains("/feed/update/urn:li:") || url.contains("/posts/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let urn = extract_urn(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"linkedin_post: cannot extract URN from '{url}' (expected /feed/update/urn:li:... or /posts/{{slug}}-{{id}})"
|
||||
))
|
||||
})?;
|
||||
|
||||
let embed_url = format!("https://www.linkedin.com/embed/feed/update/{urn}");
|
||||
let resp = client.fetch(&embed_url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"linkedin embed returned status {} for {urn}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let html = &resp.html;
|
||||
let og = parse_og_tags(html);
|
||||
let body = parse_post_body(html);
|
||||
let author = parse_author(html);
|
||||
let canonical_url = og.get("url").cloned().unwrap_or_else(|| embed_url.clone());
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"embed_url": embed_url,
|
||||
"urn": urn,
|
||||
"canonical_url": canonical_url,
|
||||
"data_completeness": "embed",
|
||||
"title": og.get("title").cloned(),
|
||||
"body": body,
|
||||
"author_name": author,
|
||||
"image_url": og.get("image").cloned(),
|
||||
"site_name": og.get("site_name").cloned().unwrap_or_else(|| "LinkedIn".into()),
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URN extraction
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Pull a `urn:li:share:N` or `urn:li:activity:N` from any LinkedIn URL.
|
||||
/// `/posts/{slug}-{id}-{suffix}` URLs encode the activity id as the second-
|
||||
/// to-last `-` separated chunk. Both forms map to a URN we can hit the
|
||||
/// embed endpoint with.
|
||||
fn extract_urn(url: &str) -> Option<String> {
|
||||
if let Some(idx) = url.find("urn:li:") {
|
||||
let tail = &url[idx..];
|
||||
let end = tail.find(['/', '?', '#']).unwrap_or(tail.len());
|
||||
let urn = &tail[..end];
|
||||
// Validate shape: urn:li:{type}:{digits}
|
||||
let mut parts = urn.split(':');
|
||||
if parts.next() == Some("urn")
|
||||
&& parts.next() == Some("li")
|
||||
&& parts.next().is_some()
|
||||
&& parts
|
||||
.next()
|
||||
.filter(|p| p.chars().all(|c| c.is_ascii_digit()))
|
||||
.is_some()
|
||||
{
|
||||
return Some(urn.to_string());
|
||||
}
|
||||
}
|
||||
|
||||
// /posts/{user}_{slug}-{19-digit-id}-{4-char-hash}/ — id is the second-
|
||||
// to-last segment after the last `-`.
|
||||
if url.contains("/posts/") {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re =
|
||||
RE.get_or_init(|| Regex::new(r"/posts/[^/]*?-(\d{15,})-[A-Za-z0-9]{2,}/?").unwrap());
|
||||
if let Some(c) = re.captures(url)
|
||||
&& let Some(id) = c.get(1)
|
||||
{
|
||||
return Some(format!("urn:li:activity:{}", id.as_str()));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// HTML scraping
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Pull `og:foo` → value pairs out of `<meta property="og:..." content="...">`.
|
||||
/// Returns lowercased keys with leading `og:` stripped.
|
||||
fn parse_og_tags(html: &str) -> std::collections::HashMap<String, String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
let mut out = std::collections::HashMap::new();
|
||||
for c in re.captures_iter(html) {
|
||||
let k = c
|
||||
.get(1)
|
||||
.map(|m| m.as_str().to_lowercase())
|
||||
.unwrap_or_default();
|
||||
let v = c
|
||||
.get(2)
|
||||
.map(|m| html_decode(m.as_str()))
|
||||
.unwrap_or_default();
|
||||
out.entry(k).or_insert(v);
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
/// Extract the post body text from the embed page. LinkedIn renders it
|
||||
/// inside `<p class="attributed-text-segment-list__content ...">{text}</p>`
|
||||
/// where the inner content can include nested `<a>` tags for links.
|
||||
fn parse_post_body(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(
|
||||
r#"(?s)<p[^>]+class="[^"]*attributed-text-segment-list__content[^"]*"[^>]*>(.*?)</p>"#,
|
||||
)
|
||||
.unwrap()
|
||||
});
|
||||
let inner = re.captures(html).and_then(|c| c.get(1))?.as_str();
|
||||
Some(strip_tags(inner).trim().to_string())
|
||||
}
|
||||
|
||||
/// Author name lives in the `<title>` like:
|
||||
/// "55 founding members are in… | Orc Dev"
|
||||
/// The chunk after the final `|` is the author display name. Falls back
|
||||
/// to the og:title minus the post body if there's no title.
|
||||
fn parse_author(html: &str) -> Option<String> {
|
||||
static RE_TITLE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE_TITLE.get_or_init(|| Regex::new(r"<title>([^<]+)</title>").unwrap());
|
||||
let title = re.captures(html).and_then(|c| c.get(1))?.as_str();
|
||||
title
|
||||
.rsplit_once('|')
|
||||
.map(|(_, name)| html_decode(name.trim()))
|
||||
}
|
||||
|
||||
/// Replace the small set of HTML entities LinkedIn (and Instagram, etc.)
|
||||
/// stuff into OG content attributes.
|
||||
fn html_decode(s: &str) -> String {
|
||||
s.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
.replace(""", "\"")
|
||||
.replace("'", "'")
|
||||
.replace("@", "@")
|
||||
.replace("•", "•")
|
||||
.replace("…", "…")
|
||||
}
|
||||
|
||||
/// Crude HTML tag stripper for the post body. Preserves text inside
|
||||
/// nested anchors so URLs don't disappear, and collapses runs of
|
||||
/// whitespace introduced by line wrapping.
|
||||
fn strip_tags(html: &str) -> String {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"<[^>]+>").unwrap());
|
||||
let no_tags = re.replace_all(html, "").to_string();
|
||||
html_decode(&no_tags)
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_li_post_urls() {
|
||||
assert!(matches(
|
||||
"https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://www.linkedin.com/feed/update/urn:li:activity:7452618583290892288"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c"
|
||||
));
|
||||
assert!(!matches("https://www.linkedin.com/in/foo"));
|
||||
assert!(!matches("https://www.linkedin.com/"));
|
||||
assert!(!matches("https://example.com/feed/update/urn:li:share:1"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_urn_from_share_url() {
|
||||
assert_eq!(
|
||||
extract_urn("https://www.linkedin.com/feed/update/urn:li:share:7452618582213144577/"),
|
||||
Some("urn:li:share:7452618582213144577".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_urn_from_pretty_post_url() {
|
||||
assert_eq!(
|
||||
extract_urn(
|
||||
"https://www.linkedin.com/posts/somebody_some-slug-7452618583290892288-aB1c/"
|
||||
),
|
||||
Some("urn:li:activity:7452618583290892288".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_og_tags_basic() {
|
||||
let html = r#"<meta property="og:image" content="https://x.com/a.png">
|
||||
<meta property="og:url" content="https://example.com/x">"#;
|
||||
let og = parse_og_tags(html);
|
||||
assert_eq!(
|
||||
og.get("image").map(String::as_str),
|
||||
Some("https://x.com/a.png")
|
||||
);
|
||||
assert_eq!(
|
||||
og.get("url").map(String::as_str),
|
||||
Some("https://example.com/x")
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_post_body_strips_anchor_tags() {
|
||||
let html = r#"<p class="attributed-text-segment-list__content text-color-text" dir="ltr">Hello <a href="x">link</a> world</p>"#;
|
||||
assert_eq!(parse_post_body(html).as_deref(), Some("Hello link world"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn html_decode_handles_common_entities() {
|
||||
assert_eq!(html_decode("AT&T @jane"), "AT&T @jane");
|
||||
}
|
||||
}
|
||||
502
crates/webclaw-fetch/src/extractors/mod.rs
Normal file
502
crates/webclaw-fetch/src/extractors/mod.rs
Normal file
|
|
@ -0,0 +1,502 @@
|
|||
//! Vertical extractors: site-specific parsers that return typed JSON
|
||||
//! instead of generic markdown.
|
||||
//!
|
||||
//! Each extractor handles a single site or platform and exposes:
|
||||
//! - `matches(url)` to claim ownership of a URL pattern
|
||||
//! - `extract(client, url)` to fetch + parse into a typed JSON `Value`
|
||||
//! - `INFO` static for the catalog (`/v1/extractors`)
|
||||
//!
|
||||
//! The dispatch in this module is a simple `match`-style chain rather than
|
||||
//! a trait registry. With ~30 extractors that's still fast and avoids the
|
||||
//! ceremony of dynamic dispatch. If we hit 50+ we'll revisit.
|
||||
//!
|
||||
//! Extractors prefer official JSON APIs over HTML scraping where one
|
||||
//! exists (Reddit, HN/Algolia, PyPI, npm, GitHub, HuggingFace all have
|
||||
//! one). HTML extraction is the fallback for sites that don't.
|
||||
|
||||
pub mod amazon_product;
|
||||
pub mod arxiv;
|
||||
pub mod crates_io;
|
||||
pub mod dev_to;
|
||||
pub mod docker_hub;
|
||||
pub mod ebay_listing;
|
||||
pub mod ecommerce_product;
|
||||
pub mod etsy_listing;
|
||||
pub mod github_issue;
|
||||
pub mod github_pr;
|
||||
pub mod github_release;
|
||||
pub mod github_repo;
|
||||
pub mod hackernews;
|
||||
pub mod huggingface_dataset;
|
||||
pub mod huggingface_model;
|
||||
pub mod instagram_post;
|
||||
pub mod instagram_profile;
|
||||
pub mod linkedin_post;
|
||||
pub mod npm;
|
||||
pub mod pypi;
|
||||
pub mod reddit;
|
||||
pub mod shopify_collection;
|
||||
pub mod shopify_product;
|
||||
pub mod stackoverflow;
|
||||
pub mod substack_post;
|
||||
pub mod trustpilot_reviews;
|
||||
pub mod woocommerce_product;
|
||||
pub mod youtube_video;
|
||||
|
||||
use serde::Serialize;
|
||||
use serde_json::Value;
|
||||
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
/// Public catalog entry for `/v1/extractors`. Stable shape — clients
|
||||
/// rely on `name` to pick the right `/v1/scrape/{name}` route.
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct ExtractorInfo {
|
||||
/// URL-safe identifier (`reddit`, `hackernews`, `github_repo`, ...).
|
||||
pub name: &'static str,
|
||||
/// Human-friendly display name.
|
||||
pub label: &'static str,
|
||||
/// One-line description of what the extractor returns.
|
||||
pub description: &'static str,
|
||||
/// Glob-ish URL pattern(s) the extractor claims. For documentation;
|
||||
/// the actual matching is done by the extractor's `matches` fn.
|
||||
pub url_patterns: &'static [&'static str],
|
||||
}
|
||||
|
||||
/// Full catalog. Order is stable; new entries append.
|
||||
pub fn list() -> Vec<ExtractorInfo> {
|
||||
vec![
|
||||
reddit::INFO,
|
||||
hackernews::INFO,
|
||||
github_repo::INFO,
|
||||
github_pr::INFO,
|
||||
github_issue::INFO,
|
||||
github_release::INFO,
|
||||
pypi::INFO,
|
||||
npm::INFO,
|
||||
crates_io::INFO,
|
||||
huggingface_model::INFO,
|
||||
huggingface_dataset::INFO,
|
||||
arxiv::INFO,
|
||||
docker_hub::INFO,
|
||||
dev_to::INFO,
|
||||
stackoverflow::INFO,
|
||||
substack_post::INFO,
|
||||
youtube_video::INFO,
|
||||
linkedin_post::INFO,
|
||||
instagram_post::INFO,
|
||||
instagram_profile::INFO,
|
||||
shopify_product::INFO,
|
||||
shopify_collection::INFO,
|
||||
ecommerce_product::INFO,
|
||||
woocommerce_product::INFO,
|
||||
amazon_product::INFO,
|
||||
ebay_listing::INFO,
|
||||
etsy_listing::INFO,
|
||||
trustpilot_reviews::INFO,
|
||||
]
|
||||
}
|
||||
|
||||
/// Auto-detect mode: try every extractor's `matches`, return the first
|
||||
/// one that claims the URL. Used by `/v1/scrape` when the caller doesn't
|
||||
/// pick a vertical explicitly.
|
||||
pub async fn dispatch_by_url(
|
||||
client: &dyn Fetcher,
|
||||
url: &str,
|
||||
) -> Option<Result<(&'static str, Value), FetchError>> {
|
||||
if reddit::matches(url) {
|
||||
return Some(
|
||||
reddit::extract(client, url)
|
||||
.await
|
||||
.map(|v| (reddit::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if hackernews::matches(url) {
|
||||
return Some(
|
||||
hackernews::extract(client, url)
|
||||
.await
|
||||
.map(|v| (hackernews::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if github_repo::matches(url) {
|
||||
return Some(
|
||||
github_repo::extract(client, url)
|
||||
.await
|
||||
.map(|v| (github_repo::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if pypi::matches(url) {
|
||||
return Some(
|
||||
pypi::extract(client, url)
|
||||
.await
|
||||
.map(|v| (pypi::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if npm::matches(url) {
|
||||
return Some(npm::extract(client, url).await.map(|v| (npm::INFO.name, v)));
|
||||
}
|
||||
if github_pr::matches(url) {
|
||||
return Some(
|
||||
github_pr::extract(client, url)
|
||||
.await
|
||||
.map(|v| (github_pr::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if github_issue::matches(url) {
|
||||
return Some(
|
||||
github_issue::extract(client, url)
|
||||
.await
|
||||
.map(|v| (github_issue::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if github_release::matches(url) {
|
||||
return Some(
|
||||
github_release::extract(client, url)
|
||||
.await
|
||||
.map(|v| (github_release::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if crates_io::matches(url) {
|
||||
return Some(
|
||||
crates_io::extract(client, url)
|
||||
.await
|
||||
.map(|v| (crates_io::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if huggingface_model::matches(url) {
|
||||
return Some(
|
||||
huggingface_model::extract(client, url)
|
||||
.await
|
||||
.map(|v| (huggingface_model::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if huggingface_dataset::matches(url) {
|
||||
return Some(
|
||||
huggingface_dataset::extract(client, url)
|
||||
.await
|
||||
.map(|v| (huggingface_dataset::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if arxiv::matches(url) {
|
||||
return Some(
|
||||
arxiv::extract(client, url)
|
||||
.await
|
||||
.map(|v| (arxiv::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if docker_hub::matches(url) {
|
||||
return Some(
|
||||
docker_hub::extract(client, url)
|
||||
.await
|
||||
.map(|v| (docker_hub::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if dev_to::matches(url) {
|
||||
return Some(
|
||||
dev_to::extract(client, url)
|
||||
.await
|
||||
.map(|v| (dev_to::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if stackoverflow::matches(url) {
|
||||
return Some(
|
||||
stackoverflow::extract(client, url)
|
||||
.await
|
||||
.map(|v| (stackoverflow::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if linkedin_post::matches(url) {
|
||||
return Some(
|
||||
linkedin_post::extract(client, url)
|
||||
.await
|
||||
.map(|v| (linkedin_post::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if instagram_post::matches(url) {
|
||||
return Some(
|
||||
instagram_post::extract(client, url)
|
||||
.await
|
||||
.map(|v| (instagram_post::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if instagram_profile::matches(url) {
|
||||
return Some(
|
||||
instagram_profile::extract(client, url)
|
||||
.await
|
||||
.map(|v| (instagram_profile::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
// Antibot-gated verticals with unique hosts: safe to auto-dispatch
|
||||
// because the matcher can't confuse the URL for anything else. The
|
||||
// extractor's smart_fetch_html path handles the blocked-without-
|
||||
// API-key case with a clear actionable error.
|
||||
if amazon_product::matches(url) {
|
||||
return Some(
|
||||
amazon_product::extract(client, url)
|
||||
.await
|
||||
.map(|v| (amazon_product::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if ebay_listing::matches(url) {
|
||||
return Some(
|
||||
ebay_listing::extract(client, url)
|
||||
.await
|
||||
.map(|v| (ebay_listing::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if etsy_listing::matches(url) {
|
||||
return Some(
|
||||
etsy_listing::extract(client, url)
|
||||
.await
|
||||
.map(|v| (etsy_listing::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if trustpilot_reviews::matches(url) {
|
||||
return Some(
|
||||
trustpilot_reviews::extract(client, url)
|
||||
.await
|
||||
.map(|v| (trustpilot_reviews::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if youtube_video::matches(url) {
|
||||
return Some(
|
||||
youtube_video::extract(client, url)
|
||||
.await
|
||||
.map(|v| (youtube_video::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
// NOTE: shopify_product, shopify_collection, ecommerce_product,
|
||||
// woocommerce_product, and substack_post are intentionally NOT
|
||||
// in auto-dispatch. Their `matches()` functions are permissive
|
||||
// (any URL with `/products/`, `/product/`, `/p/`, etc.) and
|
||||
// claiming those generically would steal URLs from the default
|
||||
// `/v1/scrape` markdown flow. Callers opt in via
|
||||
// `/v1/scrape/shopify_product` or `/v1/scrape/ecommerce_product`.
|
||||
None
|
||||
}
|
||||
|
||||
/// Explicit mode: caller picked the vertical (`POST /v1/scrape/reddit`).
|
||||
/// We still validate that the URL plausibly belongs to that vertical so
|
||||
/// users get a clear "wrong route" error instead of a confusing parse
|
||||
/// failure deep in the extractor.
|
||||
pub async fn dispatch_by_name(
|
||||
client: &dyn Fetcher,
|
||||
name: &str,
|
||||
url: &str,
|
||||
) -> Result<Value, ExtractorDispatchError> {
|
||||
match name {
|
||||
n if n == reddit::INFO.name => {
|
||||
run_or_mismatch(reddit::matches(url), n, url, || {
|
||||
reddit::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == hackernews::INFO.name => {
|
||||
run_or_mismatch(hackernews::matches(url), n, url, || {
|
||||
hackernews::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == github_repo::INFO.name => {
|
||||
run_or_mismatch(github_repo::matches(url), n, url, || {
|
||||
github_repo::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == pypi::INFO.name => {
|
||||
run_or_mismatch(pypi::matches(url), n, url, || pypi::extract(client, url)).await
|
||||
}
|
||||
n if n == npm::INFO.name => {
|
||||
run_or_mismatch(npm::matches(url), n, url, || npm::extract(client, url)).await
|
||||
}
|
||||
n if n == github_pr::INFO.name => {
|
||||
run_or_mismatch(github_pr::matches(url), n, url, || {
|
||||
github_pr::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == github_issue::INFO.name => {
|
||||
run_or_mismatch(github_issue::matches(url), n, url, || {
|
||||
github_issue::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == github_release::INFO.name => {
|
||||
run_or_mismatch(github_release::matches(url), n, url, || {
|
||||
github_release::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == crates_io::INFO.name => {
|
||||
run_or_mismatch(crates_io::matches(url), n, url, || {
|
||||
crates_io::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == huggingface_model::INFO.name => {
|
||||
run_or_mismatch(huggingface_model::matches(url), n, url, || {
|
||||
huggingface_model::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == huggingface_dataset::INFO.name => {
|
||||
run_or_mismatch(huggingface_dataset::matches(url), n, url, || {
|
||||
huggingface_dataset::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == arxiv::INFO.name => {
|
||||
run_or_mismatch(arxiv::matches(url), n, url, || arxiv::extract(client, url)).await
|
||||
}
|
||||
n if n == docker_hub::INFO.name => {
|
||||
run_or_mismatch(docker_hub::matches(url), n, url, || {
|
||||
docker_hub::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == dev_to::INFO.name => {
|
||||
run_or_mismatch(dev_to::matches(url), n, url, || {
|
||||
dev_to::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == stackoverflow::INFO.name => {
|
||||
run_or_mismatch(stackoverflow::matches(url), n, url, || {
|
||||
stackoverflow::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == linkedin_post::INFO.name => {
|
||||
run_or_mismatch(linkedin_post::matches(url), n, url, || {
|
||||
linkedin_post::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == instagram_post::INFO.name => {
|
||||
run_or_mismatch(instagram_post::matches(url), n, url, || {
|
||||
instagram_post::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == instagram_profile::INFO.name => {
|
||||
run_or_mismatch(instagram_profile::matches(url), n, url, || {
|
||||
instagram_profile::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == shopify_product::INFO.name => {
|
||||
run_or_mismatch(shopify_product::matches(url), n, url, || {
|
||||
shopify_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == ecommerce_product::INFO.name => {
|
||||
run_or_mismatch(ecommerce_product::matches(url), n, url, || {
|
||||
ecommerce_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == amazon_product::INFO.name => {
|
||||
run_or_mismatch(amazon_product::matches(url), n, url, || {
|
||||
amazon_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == ebay_listing::INFO.name => {
|
||||
run_or_mismatch(ebay_listing::matches(url), n, url, || {
|
||||
ebay_listing::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == etsy_listing::INFO.name => {
|
||||
run_or_mismatch(etsy_listing::matches(url), n, url, || {
|
||||
etsy_listing::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == trustpilot_reviews::INFO.name => {
|
||||
run_or_mismatch(trustpilot_reviews::matches(url), n, url, || {
|
||||
trustpilot_reviews::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == youtube_video::INFO.name => {
|
||||
run_or_mismatch(youtube_video::matches(url), n, url, || {
|
||||
youtube_video::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == substack_post::INFO.name => {
|
||||
run_or_mismatch(substack_post::matches(url), n, url, || {
|
||||
substack_post::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == shopify_collection::INFO.name => {
|
||||
run_or_mismatch(shopify_collection::matches(url), n, url, || {
|
||||
shopify_collection::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == woocommerce_product::INFO.name => {
|
||||
run_or_mismatch(woocommerce_product::matches(url), n, url, || {
|
||||
woocommerce_product::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
_ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
|
||||
}
|
||||
}
|
||||
|
||||
/// Errors that the dispatcher itself raises (vs. errors from inside an
|
||||
/// extractor, which come back wrapped in `Fetch`).
|
||||
#[derive(Debug, thiserror::Error)]
|
||||
pub enum ExtractorDispatchError {
|
||||
#[error("unknown vertical: '{0}'")]
|
||||
UnknownVertical(String),
|
||||
|
||||
#[error("URL '{url}' does not match the '{vertical}' extractor")]
|
||||
UrlMismatch { vertical: String, url: String },
|
||||
|
||||
#[error(transparent)]
|
||||
Fetch(#[from] FetchError),
|
||||
}
|
||||
|
||||
/// Helper: when the caller explicitly picked a vertical but their URL
|
||||
/// doesn't match it, return `UrlMismatch` instead of running the
|
||||
/// extractor (which would just fail with a less-clear error).
|
||||
async fn run_or_mismatch<F, Fut>(
|
||||
matches: bool,
|
||||
vertical: &str,
|
||||
url: &str,
|
||||
f: F,
|
||||
) -> Result<Value, ExtractorDispatchError>
|
||||
where
|
||||
F: FnOnce() -> Fut,
|
||||
Fut: std::future::Future<Output = Result<Value, FetchError>>,
|
||||
{
|
||||
if !matches {
|
||||
return Err(ExtractorDispatchError::UrlMismatch {
|
||||
vertical: vertical.to_string(),
|
||||
url: url.to_string(),
|
||||
});
|
||||
}
|
||||
f().await.map_err(ExtractorDispatchError::Fetch)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn list_is_non_empty_and_unique() {
|
||||
let entries = list();
|
||||
assert!(!entries.is_empty());
|
||||
let mut names: Vec<_> = entries.iter().map(|e| e.name).collect();
|
||||
names.sort();
|
||||
let before = names.len();
|
||||
names.dedup();
|
||||
assert_eq!(before, names.len(), "extractor names must be unique");
|
||||
}
|
||||
}
|
||||
235
crates/webclaw-fetch/src/extractors/npm.rs
Normal file
235
crates/webclaw-fetch/src/extractors/npm.rs
Normal file
|
|
@ -0,0 +1,235 @@
|
|||
//! npm package structured extractor.
|
||||
//!
|
||||
//! Uses two npm-run APIs:
|
||||
//! - `registry.npmjs.org/{name}` for full package metadata
|
||||
//! - `api.npmjs.org/downloads/point/last-week/{name}` for usage signal
|
||||
//!
|
||||
//! The registry API returns the *full* document including every version
|
||||
//! ever published, which can be tens of MB for popular packages
|
||||
//! (`@types/node` etc). We strip down to the latest version's manifest
|
||||
//! and a count of releases — full history would explode the response.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "npm",
|
||||
label: "npm package",
|
||||
description: "Returns package metadata: latest version manifest, dependencies, weekly downloads, license.",
|
||||
url_patterns: &["https://www.npmjs.com/package/{name}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "www.npmjs.com" && host != "npmjs.com" {
|
||||
return false;
|
||||
}
|
||||
url.contains("/package/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let name = parse_name(url)
|
||||
.ok_or_else(|| FetchError::Build(format!("npm: cannot parse name from '{url}'")))?;
|
||||
|
||||
let registry_url = format!("https://registry.npmjs.org/{}", urlencode_segment(&name));
|
||||
let resp = client.fetch(®istry_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"npm: package '{name}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"npm registry returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let pkg: PackageDoc = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("npm registry parse: {e}")))?;
|
||||
|
||||
// Resolve "latest" to a concrete version.
|
||||
let latest_version = pkg
|
||||
.dist_tags
|
||||
.as_ref()
|
||||
.and_then(|t| t.get("latest"))
|
||||
.cloned()
|
||||
.or_else(|| pkg.versions.as_ref().and_then(|v| v.keys().last().cloned()));
|
||||
|
||||
let latest_manifest = latest_version
|
||||
.as_deref()
|
||||
.and_then(|v| pkg.versions.as_ref().and_then(|m| m.get(v)));
|
||||
|
||||
let release_count = pkg.versions.as_ref().map(|v| v.len()).unwrap_or(0);
|
||||
let latest_release_date = latest_version
|
||||
.as_deref()
|
||||
.and_then(|v| pkg.time.as_ref().and_then(|t| t.get(v).cloned()));
|
||||
|
||||
// Best-effort weekly downloads. If the api.npmjs.org call fails we
|
||||
// surface `null` rather than failing the whole extractor — npm
|
||||
// sometimes 503s the downloads endpoint while the registry is up.
|
||||
let weekly_downloads = fetch_weekly_downloads(client, &name).await.ok();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"name": pkg.name.clone().unwrap_or(name.clone()),
|
||||
"description": pkg.description,
|
||||
"latest_version": latest_version,
|
||||
"license": latest_manifest.and_then(|m| m.license.clone()),
|
||||
"homepage": pkg.homepage,
|
||||
"repository": pkg.repository.as_ref().and_then(|r| r.url.clone()),
|
||||
"dependencies": latest_manifest.and_then(|m| m.dependencies.clone()),
|
||||
"dev_dependencies": latest_manifest.and_then(|m| m.dev_dependencies.clone()),
|
||||
"peer_dependencies": latest_manifest.and_then(|m| m.peer_dependencies.clone()),
|
||||
"keywords": pkg.keywords,
|
||||
"maintainers": pkg.maintainers,
|
||||
"deprecated": latest_manifest.and_then(|m| m.deprecated.clone()),
|
||||
"release_count": release_count,
|
||||
"latest_release_date": latest_release_date,
|
||||
"weekly_downloads": weekly_downloads,
|
||||
}))
|
||||
}
|
||||
|
||||
async fn fetch_weekly_downloads(client: &dyn Fetcher, name: &str) -> Result<i64, FetchError> {
|
||||
let url = format!(
|
||||
"https://api.npmjs.org/downloads/point/last-week/{}",
|
||||
urlencode_segment(name)
|
||||
);
|
||||
let resp = client.fetch(&url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"npm downloads api status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
let dl: Downloads = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("npm downloads parse: {e}")))?;
|
||||
Ok(dl.downloads)
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Extract the package name from an npmjs.com URL. Handles scoped packages
|
||||
/// (`/package/@scope/name`) and trailing path segments (`/v/x.y.z`).
|
||||
fn parse_name(url: &str) -> Option<String> {
|
||||
let after = url.split("/package/").nth(1)?;
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let first = segs.next()?;
|
||||
if first.starts_with('@') {
|
||||
let second = segs.next()?;
|
||||
Some(format!("{first}/{second}"))
|
||||
} else {
|
||||
Some(first.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
/// `@scope/name` must encode the `/` for the registry path. Plain names
|
||||
/// pass through untouched.
|
||||
fn urlencode_segment(name: &str) -> String {
|
||||
name.replace('/', "%2F")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Registry types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct PackageDoc {
|
||||
name: Option<String>,
|
||||
description: Option<String>,
|
||||
homepage: Option<serde_json::Value>, // sometimes string, sometimes object
|
||||
repository: Option<Repository>,
|
||||
keywords: Option<Vec<String>>,
|
||||
maintainers: Option<Vec<Maintainer>>,
|
||||
#[serde(rename = "dist-tags")]
|
||||
dist_tags: Option<std::collections::BTreeMap<String, String>>,
|
||||
versions: Option<std::collections::BTreeMap<String, VersionManifest>>,
|
||||
time: Option<std::collections::BTreeMap<String, String>>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Default, Clone)]
|
||||
struct VersionManifest {
|
||||
license: Option<serde_json::Value>, // string or object
|
||||
dependencies: Option<std::collections::BTreeMap<String, String>>,
|
||||
#[serde(rename = "devDependencies")]
|
||||
dev_dependencies: Option<std::collections::BTreeMap<String, String>>,
|
||||
#[serde(rename = "peerDependencies")]
|
||||
peer_dependencies: Option<std::collections::BTreeMap<String, String>>,
|
||||
// `deprecated` is sometimes a bool and sometimes a string in the
|
||||
// registry. serde_json::Value covers both without failing the parse.
|
||||
deprecated: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Repository {
|
||||
url: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Clone)]
|
||||
struct Maintainer {
|
||||
name: Option<String>,
|
||||
email: Option<String>,
|
||||
}
|
||||
|
||||
impl serde::Serialize for Maintainer {
|
||||
fn serialize<S: serde::Serializer>(&self, s: S) -> Result<S::Ok, S::Error> {
|
||||
use serde::ser::SerializeMap;
|
||||
let mut m = s.serialize_map(Some(2))?;
|
||||
m.serialize_entry("name", &self.name)?;
|
||||
m.serialize_entry("email", &self.email)?;
|
||||
m.end()
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Downloads {
|
||||
downloads: i64,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_npm_package_urls() {
|
||||
assert!(matches("https://www.npmjs.com/package/react"));
|
||||
assert!(matches("https://www.npmjs.com/package/@types/node"));
|
||||
assert!(matches("https://npmjs.com/package/lodash"));
|
||||
assert!(!matches("https://www.npmjs.com/"));
|
||||
assert!(!matches("https://example.com/package/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_name_handles_scoped_and_unscoped() {
|
||||
assert_eq!(
|
||||
parse_name("https://www.npmjs.com/package/react"),
|
||||
Some("react".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_name("https://www.npmjs.com/package/@types/node"),
|
||||
Some("@types/node".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_name("https://www.npmjs.com/package/lodash/v/4.17.21"),
|
||||
Some("lodash".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn urlencode_only_touches_scope_separator() {
|
||||
assert_eq!(urlencode_segment("react"), "react");
|
||||
assert_eq!(urlencode_segment("@types/node"), "@types%2Fnode");
|
||||
}
|
||||
}
|
||||
184
crates/webclaw-fetch/src/extractors/pypi.rs
Normal file
184
crates/webclaw-fetch/src/extractors/pypi.rs
Normal file
|
|
@ -0,0 +1,184 @@
|
|||
//! PyPI package structured extractor.
|
||||
//!
|
||||
//! PyPI exposes a stable JSON API at `pypi.org/pypi/{name}/json` and
|
||||
//! a versioned form at `pypi.org/pypi/{name}/{version}/json`. Both
|
||||
//! return the full release info plus history. No auth, no rate limits
|
||||
//! that we hit at normal usage.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "pypi",
|
||||
label: "PyPI package",
|
||||
description: "Returns package metadata: latest version, dependencies, license, release history.",
|
||||
url_patterns: &[
|
||||
"https://pypi.org/project/{name}/",
|
||||
"https://pypi.org/project/{name}/{version}/",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "pypi.org" && host != "www.pypi.org" {
|
||||
return false;
|
||||
}
|
||||
url.contains("/project/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (name, version) = parse_project(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("pypi: cannot parse package name from '{url}'"))
|
||||
})?;
|
||||
|
||||
let api_url = match &version {
|
||||
Some(v) => format!("https://pypi.org/pypi/{name}/{v}/json"),
|
||||
None => format!("https://pypi.org/pypi/{name}/json"),
|
||||
};
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"pypi: package '{name}' not found"
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"pypi api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let pkg: PypiResponse = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("pypi parse: {e}")))?;
|
||||
|
||||
let info = pkg.info;
|
||||
let release_count = pkg.releases.as_ref().map(|r| r.len()).unwrap_or(0);
|
||||
|
||||
// Latest release date = max upload time across files in the latest version.
|
||||
let latest_release_date = pkg
|
||||
.releases
|
||||
.as_ref()
|
||||
.and_then(|map| info.version.as_deref().and_then(|v| map.get(v)))
|
||||
.and_then(|files| files.iter().filter_map(|f| f.upload_time.clone()).max());
|
||||
|
||||
// Drop the long description from the JSON shape — it's frequently a 50KB
|
||||
// README and bloats responses. Callers who need it can hit /v1/scrape.
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"name": info.name,
|
||||
"version": info.version,
|
||||
"summary": info.summary,
|
||||
"homepage": info.home_page,
|
||||
"license": info.license,
|
||||
"license_classifier": pick_license_classifier(&info.classifiers),
|
||||
"author": info.author,
|
||||
"author_email": info.author_email,
|
||||
"maintainer": info.maintainer,
|
||||
"requires_python": info.requires_python,
|
||||
"requires_dist": info.requires_dist,
|
||||
"keywords": info.keywords,
|
||||
"classifiers": info.classifiers,
|
||||
"yanked": info.yanked,
|
||||
"yanked_reason": info.yanked_reason,
|
||||
"project_urls": info.project_urls,
|
||||
"release_count": release_count,
|
||||
"latest_release_date": latest_release_date,
|
||||
}))
|
||||
}
|
||||
|
||||
/// PyPI puts the SPDX-ish license under classifiers like
|
||||
/// `License :: OSI Approved :: Apache Software License`. Surface the most
|
||||
/// specific one when the `license` field itself is empty/junk.
|
||||
fn pick_license_classifier(classifiers: &Option<Vec<String>>) -> Option<String> {
|
||||
classifiers
|
||||
.as_ref()?
|
||||
.iter()
|
||||
.filter(|c| c.starts_with("License ::"))
|
||||
.max_by_key(|c| c.len())
|
||||
.cloned()
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_project(url: &str) -> Option<(String, Option<String>)> {
|
||||
let after = url.split("/project/").nth(1)?;
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let mut segs = stripped.split('/').filter(|s| !s.is_empty());
|
||||
let name = segs.next()?.to_string();
|
||||
let version = segs.next().map(|v| v.to_string());
|
||||
Some((name, version))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// PyPI API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct PypiResponse {
|
||||
info: Info,
|
||||
releases: Option<std::collections::BTreeMap<String, Vec<File>>>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Info {
|
||||
name: Option<String>,
|
||||
version: Option<String>,
|
||||
summary: Option<String>,
|
||||
home_page: Option<String>,
|
||||
license: Option<String>,
|
||||
author: Option<String>,
|
||||
author_email: Option<String>,
|
||||
maintainer: Option<String>,
|
||||
requires_python: Option<String>,
|
||||
requires_dist: Option<Vec<String>>,
|
||||
keywords: Option<String>,
|
||||
classifiers: Option<Vec<String>>,
|
||||
yanked: Option<bool>,
|
||||
yanked_reason: Option<String>,
|
||||
project_urls: Option<std::collections::BTreeMap<String, String>>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct File {
|
||||
upload_time: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_project_urls() {
|
||||
assert!(matches("https://pypi.org/project/requests/"));
|
||||
assert!(matches("https://pypi.org/project/numpy/1.26.0/"));
|
||||
assert!(!matches("https://pypi.org/"));
|
||||
assert!(!matches("https://example.com/project/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_project_pulls_name_and_version() {
|
||||
assert_eq!(
|
||||
parse_project("https://pypi.org/project/requests/"),
|
||||
Some(("requests".into(), None))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_project("https://pypi.org/project/numpy/1.26.0/"),
|
||||
Some(("numpy".into(), Some("1.26.0".into())))
|
||||
);
|
||||
assert_eq!(
|
||||
parse_project("https://pypi.org/project/scikit-learn/?foo=bar"),
|
||||
Some(("scikit-learn".into(), None))
|
||||
);
|
||||
}
|
||||
}
|
||||
234
crates/webclaw-fetch/src/extractors/reddit.rs
Normal file
234
crates/webclaw-fetch/src/extractors/reddit.rs
Normal file
|
|
@ -0,0 +1,234 @@
|
|||
//! Reddit structured extractor — returns the full post + comment tree
|
||||
//! as typed JSON via Reddit's `.json` API.
|
||||
//!
|
||||
//! The same trick the markdown extractor in `crate::reddit` uses:
|
||||
//! appending `.json` to any post URL returns the data the new SPA
|
||||
//! frontend would load client-side. Zero antibot, zero JS rendering.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "reddit",
|
||||
label: "Reddit thread",
|
||||
description: "Returns post + nested comment tree with scores, authors, and timestamps.",
|
||||
url_patterns: &[
|
||||
"https://www.reddit.com/r/*/comments/*",
|
||||
"https://reddit.com/r/*/comments/*",
|
||||
"https://old.reddit.com/r/*/comments/*",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
let is_reddit_host = matches!(
|
||||
host,
|
||||
"reddit.com" | "www.reddit.com" | "old.reddit.com" | "np.reddit.com" | "new.reddit.com"
|
||||
);
|
||||
is_reddit_host && url.contains("/comments/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let json_url = build_json_url(url);
|
||||
let resp = client.fetch(&json_url).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"reddit api returned status {}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let listings: Vec<Listing> = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("reddit json parse: {e}")))?;
|
||||
|
||||
if listings.is_empty() {
|
||||
return Err(FetchError::BodyDecode("reddit response empty".into()));
|
||||
}
|
||||
|
||||
// First listing = the post (single t3 child).
|
||||
let post = listings
|
||||
.first()
|
||||
.and_then(|l| l.data.children.first())
|
||||
.filter(|t| t.kind == "t3")
|
||||
.map(|t| post_json(&t.data))
|
||||
.unwrap_or(Value::Null);
|
||||
|
||||
// Second listing = the comment tree.
|
||||
let comments: Vec<Value> = listings
|
||||
.get(1)
|
||||
.map(|l| l.data.children.iter().filter_map(comment_json).collect())
|
||||
.unwrap_or_default();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"post": post,
|
||||
"comments": comments,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON shapers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn post_json(d: &ThingData) -> Value {
|
||||
json!({
|
||||
"id": d.id,
|
||||
"title": d.title,
|
||||
"author": d.author,
|
||||
"subreddit": d.subreddit_name_prefixed,
|
||||
"permalink": d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
|
||||
"url": d.url_overridden_by_dest,
|
||||
"is_self": d.is_self,
|
||||
"selftext": d.selftext,
|
||||
"score": d.score,
|
||||
"upvote_ratio": d.upvote_ratio,
|
||||
"num_comments": d.num_comments,
|
||||
"created_utc": d.created_utc,
|
||||
"link_flair_text": d.link_flair_text,
|
||||
"over_18": d.over_18,
|
||||
"spoiler": d.spoiler,
|
||||
"stickied": d.stickied,
|
||||
"locked": d.locked,
|
||||
})
|
||||
}
|
||||
|
||||
/// Render a single comment + its reply tree. Returns `None` for non-t1
|
||||
/// kinds (the trailing `more` placeholder Reddit injects at depth limits).
|
||||
fn comment_json(thing: &Thing) -> Option<Value> {
|
||||
if thing.kind != "t1" {
|
||||
return None;
|
||||
}
|
||||
let d = &thing.data;
|
||||
let replies: Vec<Value> = match &d.replies {
|
||||
Some(Replies::Listing(l)) => l.data.children.iter().filter_map(comment_json).collect(),
|
||||
_ => Vec::new(),
|
||||
};
|
||||
Some(json!({
|
||||
"id": d.id,
|
||||
"author": d.author,
|
||||
"body": d.body,
|
||||
"score": d.score,
|
||||
"created_utc": d.created_utc,
|
||||
"is_submitter": d.is_submitter,
|
||||
"stickied": d.stickied,
|
||||
"depth": d.depth,
|
||||
"permalink": d.permalink.as_ref().map(|p| format!("https://www.reddit.com{p}")),
|
||||
"replies": replies,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Build the Reddit JSON URL. We keep the original host (`www.reddit.com`
|
||||
/// or `old.reddit.com` as the caller gave us). Routing through
|
||||
/// `old.reddit.com` unconditionally looks appealing but that host has
|
||||
/// stricter UA-based blocking than `www.reddit.com`, while the main
|
||||
/// host accepts our Chrome-fingerprinted client fine.
|
||||
fn build_json_url(url: &str) -> String {
|
||||
let clean = url.split('?').next().unwrap_or(url).trim_end_matches('/');
|
||||
format!("{clean}.json?raw_json=1")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Reddit JSON types — only fields we render. Everything else is dropped.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Listing {
|
||||
data: ListingData,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ListingData {
|
||||
children: Vec<Thing>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Thing {
|
||||
kind: String,
|
||||
data: ThingData,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Default)]
|
||||
struct ThingData {
|
||||
// post (t3)
|
||||
id: Option<String>,
|
||||
title: Option<String>,
|
||||
selftext: Option<String>,
|
||||
subreddit_name_prefixed: Option<String>,
|
||||
url_overridden_by_dest: Option<String>,
|
||||
is_self: Option<bool>,
|
||||
upvote_ratio: Option<f64>,
|
||||
num_comments: Option<i64>,
|
||||
over_18: Option<bool>,
|
||||
spoiler: Option<bool>,
|
||||
stickied: Option<bool>,
|
||||
locked: Option<bool>,
|
||||
link_flair_text: Option<String>,
|
||||
|
||||
// comment (t1)
|
||||
author: Option<String>,
|
||||
body: Option<String>,
|
||||
score: Option<i64>,
|
||||
created_utc: Option<f64>,
|
||||
is_submitter: Option<bool>,
|
||||
depth: Option<i64>,
|
||||
permalink: Option<String>,
|
||||
|
||||
// recursive
|
||||
replies: Option<Replies>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
#[serde(untagged)]
|
||||
enum Replies {
|
||||
Listing(Listing),
|
||||
#[allow(dead_code)]
|
||||
Empty(String),
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_reddit_post_urls() {
|
||||
assert!(matches(
|
||||
"https://www.reddit.com/r/rust/comments/abc123/some_title/"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://reddit.com/r/rust/comments/abc123/some_title"
|
||||
));
|
||||
assert!(matches("https://old.reddit.com/r/rust/comments/abc123/x/"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_post_reddit_urls() {
|
||||
assert!(!matches("https://www.reddit.com/r/rust"));
|
||||
assert!(!matches("https://www.reddit.com/user/foo"));
|
||||
assert!(!matches("https://example.com/r/rust/comments/x"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn json_url_appends_suffix_and_drops_query() {
|
||||
assert_eq!(
|
||||
build_json_url("https://www.reddit.com/r/rust/comments/abc/x/?utm=foo"),
|
||||
"https://www.reddit.com/r/rust/comments/abc/x.json?raw_json=1"
|
||||
);
|
||||
}
|
||||
}
|
||||
242
crates/webclaw-fetch/src/extractors/shopify_collection.rs
Normal file
242
crates/webclaw-fetch/src/extractors/shopify_collection.rs
Normal file
|
|
@ -0,0 +1,242 @@
|
|||
//! Shopify collection structured extractor.
|
||||
//!
|
||||
//! Every Shopify store exposes `/collections/{handle}.json` and
|
||||
//! `/collections/{handle}/products.json` on the public surface. This
|
||||
//! extractor hits `.json` (collection metadata) and falls through to
|
||||
//! `/products.json` for the first page of products. Same caveat as
|
||||
//! `shopify_product`: stores with Cloudflare in front of the shop
|
||||
//! will 403 the public path.
|
||||
//!
|
||||
//! Explicit-call only (like `shopify_product`). `/collections/{slug}`
|
||||
//! is a URL shape used by non-Shopify stores too, so auto-dispatch
|
||||
//! would claim too many URLs.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "shopify_collection",
|
||||
label: "Shopify collection",
|
||||
description: "Returns collection metadata + first page of products (handle, title, vendor, price, available) on ANY Shopify store via /collections/{handle}.json + /products.json.",
|
||||
url_patterns: &[
|
||||
"https://{shop}/collections/{handle}",
|
||||
"https://{shop}.myshopify.com/collections/{handle}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
|
||||
return false;
|
||||
}
|
||||
url.contains("/collections/") && !url.ends_with("/collections/")
|
||||
}
|
||||
|
||||
const NON_SHOPIFY_HOSTS: &[&str] = &[
|
||||
"amazon.com",
|
||||
"amazon.co.uk",
|
||||
"amazon.de",
|
||||
"ebay.com",
|
||||
"etsy.com",
|
||||
"walmart.com",
|
||||
"target.com",
|
||||
"aliexpress.com",
|
||||
"huggingface.co", // has /collections/ for models
|
||||
"github.com",
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let (coll_meta_url, coll_products_url) = build_json_urls(url);
|
||||
|
||||
// Step 1: collection metadata. Shopify returns 200 on missing
|
||||
// collections sometimes; check "collection" key below.
|
||||
let meta_resp = client.fetch(&coll_meta_url).await?;
|
||||
if meta_resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify_collection: '{url}' not found"
|
||||
)));
|
||||
}
|
||||
if meta_resp.status == 403 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify_collection: {coll_meta_url} returned 403. The store has antibot in front of the .json endpoint. Use /v1/scrape/ecommerce_product or api.webclaw.io for this store."
|
||||
)));
|
||||
}
|
||||
if meta_resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify returned status {} for {coll_meta_url}",
|
||||
meta_resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let meta: MetaWrapper = serde_json::from_str(&meta_resp.html).map_err(|e| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"shopify_collection: '{url}' didn't return Shopify JSON, likely not a Shopify store ({e})"
|
||||
))
|
||||
})?;
|
||||
|
||||
// Step 2: first page of products for this collection.
|
||||
let products = match client.fetch(&coll_products_url).await {
|
||||
Ok(r) if r.status == 200 => serde_json::from_str::<ProductsWrapper>(&r.html)
|
||||
.ok()
|
||||
.map(|pw| pw.products)
|
||||
.unwrap_or_default(),
|
||||
_ => Vec::new(),
|
||||
};
|
||||
|
||||
let product_summaries: Vec<Value> = products
|
||||
.iter()
|
||||
.map(|p| {
|
||||
let first_variant = p.variants.first();
|
||||
json!({
|
||||
"id": p.id,
|
||||
"handle": p.handle,
|
||||
"title": p.title,
|
||||
"vendor": p.vendor,
|
||||
"product_type": p.product_type,
|
||||
"price": first_variant.and_then(|v| v.price.clone()),
|
||||
"compare_at_price":first_variant.and_then(|v| v.compare_at_price.clone()),
|
||||
"available": p.variants.iter().any(|v| v.available.unwrap_or(false)),
|
||||
"variant_count": p.variants.len(),
|
||||
"image": p.images.first().and_then(|i| i.src.clone()),
|
||||
"created_at": p.created_at,
|
||||
"updated_at": p.updated_at,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
let c = meta.collection;
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"meta_json_url": coll_meta_url,
|
||||
"products_json_url": coll_products_url,
|
||||
"collection_id": c.id,
|
||||
"handle": c.handle,
|
||||
"title": c.title,
|
||||
"description_html": c.body_html,
|
||||
"published_at": c.published_at,
|
||||
"updated_at": c.updated_at,
|
||||
"sort_order": c.sort_order,
|
||||
"products_in_page": product_summaries.len(),
|
||||
"products": product_summaries,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Build `(collection.json, collection/products.json)` from a user URL.
|
||||
fn build_json_urls(url: &str) -> (String, String) {
|
||||
let (path_part, _query_part) = match url.split_once('?') {
|
||||
Some((a, b)) => (a, Some(b)),
|
||||
None => (url, None),
|
||||
};
|
||||
let clean = path_part.trim_end_matches('/').trim_end_matches(".json");
|
||||
(
|
||||
format!("{clean}.json"),
|
||||
format!("{clean}/products.json?limit=50"),
|
||||
)
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Shopify collection + product JSON shapes (subsets)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct MetaWrapper {
|
||||
collection: Collection,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Collection {
|
||||
id: Option<i64>,
|
||||
handle: Option<String>,
|
||||
title: Option<String>,
|
||||
body_html: Option<String>,
|
||||
published_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
sort_order: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ProductsWrapper {
|
||||
#[serde(default)]
|
||||
products: Vec<ProductSummary>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ProductSummary {
|
||||
id: Option<i64>,
|
||||
handle: Option<String>,
|
||||
title: Option<String>,
|
||||
vendor: Option<String>,
|
||||
product_type: Option<String>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
#[serde(default)]
|
||||
variants: Vec<VariantSummary>,
|
||||
#[serde(default)]
|
||||
images: Vec<ImageSummary>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct VariantSummary {
|
||||
price: Option<String>,
|
||||
compare_at_price: Option<String>,
|
||||
available: Option<bool>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct ImageSummary {
|
||||
src: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_shopify_collection_urls() {
|
||||
assert!(matches("https://www.allbirds.com/collections/mens"));
|
||||
assert!(matches(
|
||||
"https://shop.example.com/collections/new-arrivals?page=2"
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_shopify() {
|
||||
assert!(!matches("https://github.com/collections/foo"));
|
||||
assert!(!matches("https://huggingface.co/collections/foo"));
|
||||
assert!(!matches("https://example.com/"));
|
||||
assert!(!matches("https://example.com/collections/"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_json_urls_derives_both_paths() {
|
||||
let (meta, products) = build_json_urls("https://shop.example.com/collections/mens");
|
||||
assert_eq!(meta, "https://shop.example.com/collections/mens.json");
|
||||
assert_eq!(
|
||||
products,
|
||||
"https://shop.example.com/collections/mens/products.json?limit=50"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_json_urls_handles_trailing_slash() {
|
||||
let (meta, _) = build_json_urls("https://shop.example.com/collections/mens/");
|
||||
assert_eq!(meta, "https://shop.example.com/collections/mens.json");
|
||||
}
|
||||
}
|
||||
318
crates/webclaw-fetch/src/extractors/shopify_product.rs
Normal file
318
crates/webclaw-fetch/src/extractors/shopify_product.rs
Normal file
|
|
@ -0,0 +1,318 @@
|
|||
//! Shopify product structured extractor.
|
||||
//!
|
||||
//! Every Shopify store exposes a public JSON endpoint for each product
|
||||
//! by appending `.json` to the product URL:
|
||||
//!
|
||||
//! https://shop.example.com/products/cool-tshirt
|
||||
//! → https://shop.example.com/products/cool-tshirt.json
|
||||
//!
|
||||
//! There are ~4 million Shopify stores. The `.json` endpoint is
|
||||
//! undocumented but has been stable for 10+ years. When a store puts
|
||||
//! Cloudflare / antibot in front of the shop, this path can 403 just
|
||||
//! like any other — for those cases the caller should fall back to
|
||||
//! `ecommerce_product` (JSON-LD) or the cloud tier.
|
||||
//!
|
||||
//! This extractor is **explicit-call only** — it is NOT auto-dispatched
|
||||
//! from `/v1/scrape` because we cannot tell ahead of time whether an
|
||||
//! arbitrary `/products/{slug}` URL is a Shopify store. Callers hit
|
||||
//! `/v1/scrape/shopify_product` when they know.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "shopify_product",
|
||||
label: "Shopify product",
|
||||
description: "Returns product metadata on ANY Shopify store via the public /products/{handle}.json endpoint: title, vendor, variants with prices + stock, images, options.",
|
||||
url_patterns: &[
|
||||
"https://{shop}/products/{handle}",
|
||||
"https://{shop}.myshopify.com/products/{handle}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
// Any URL whose path contains /products/{something}. We do not
|
||||
// filter by host — Shopify powers custom-domain stores. The
|
||||
// extractor's /.json fallback is what confirms Shopify; `matches`
|
||||
// just says "this is a plausible shape." Still reject obviously
|
||||
// non-Shopify known hosts to save a failed request.
|
||||
let host = host_of(url);
|
||||
if host.is_empty() || NON_SHOPIFY_HOSTS.iter().any(|h| host.ends_with(h)) {
|
||||
return false;
|
||||
}
|
||||
url.contains("/products/") && !url.ends_with("/products/")
|
||||
}
|
||||
|
||||
/// Hosts we know are not Shopify — reject so we don't burn a request.
|
||||
const NON_SHOPIFY_HOSTS: &[&str] = &[
|
||||
"amazon.com",
|
||||
"amazon.co.uk",
|
||||
"amazon.de",
|
||||
"amazon.fr",
|
||||
"amazon.it",
|
||||
"ebay.com",
|
||||
"etsy.com",
|
||||
"walmart.com",
|
||||
"target.com",
|
||||
"aliexpress.com",
|
||||
"bestbuy.com",
|
||||
"wayfair.com",
|
||||
"homedepot.com",
|
||||
"github.com", // /products is a marketing page
|
||||
];
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let json_url = build_json_url(url);
|
||||
let resp = client.fetch(&json_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify_product: '{url}' not found (got 404 from {json_url})"
|
||||
)));
|
||||
}
|
||||
if resp.status == 403 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify_product: {json_url} returned 403 — the store has antibot in front of the .json endpoint. Try /v1/scrape/ecommerce_product for the HTML + JSON-LD fallback."
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"shopify returned status {} for {json_url}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let body: Wrapper = serde_json::from_str(&resp.html).map_err(|e| {
|
||||
FetchError::BodyDecode(format!(
|
||||
"shopify_product: '{url}' didn't return Shopify JSON — likely not a Shopify store ({e})"
|
||||
))
|
||||
})?;
|
||||
let p = body.product;
|
||||
|
||||
let variants: Vec<Value> = p
|
||||
.variants
|
||||
.iter()
|
||||
.map(|v| {
|
||||
json!({
|
||||
"id": v.id,
|
||||
"title": v.title,
|
||||
"sku": v.sku,
|
||||
"barcode": v.barcode,
|
||||
"price": v.price,
|
||||
"compare_at_price": v.compare_at_price,
|
||||
"available": v.available,
|
||||
"inventory_quantity": v.inventory_quantity,
|
||||
"position": v.position,
|
||||
"weight": v.weight,
|
||||
"weight_unit": v.weight_unit,
|
||||
"requires_shipping": v.requires_shipping,
|
||||
"taxable": v.taxable,
|
||||
"option1": v.option1,
|
||||
"option2": v.option2,
|
||||
"option3": v.option3,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
let images: Vec<Value> = p
|
||||
.images
|
||||
.iter()
|
||||
.map(|i| {
|
||||
json!({
|
||||
"src": i.src,
|
||||
"width": i.width,
|
||||
"height": i.height,
|
||||
"position": i.position,
|
||||
"alt": i.alt,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
let options: Vec<Value> = p
|
||||
.options
|
||||
.iter()
|
||||
.map(|o| json!({"name": o.name, "values": o.values, "position": o.position}))
|
||||
.collect();
|
||||
|
||||
// Price range + availability summary across variants (the shape
|
||||
// agents typically want without walking the variants array).
|
||||
let prices: Vec<f64> = p
|
||||
.variants
|
||||
.iter()
|
||||
.filter_map(|v| v.price.as_deref().and_then(|s| s.parse::<f64>().ok()))
|
||||
.collect();
|
||||
let any_available = p.variants.iter().any(|v| v.available.unwrap_or(false));
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"json_url": json_url,
|
||||
"product_id": p.id,
|
||||
"handle": p.handle,
|
||||
"title": p.title,
|
||||
"vendor": p.vendor,
|
||||
"product_type": p.product_type,
|
||||
"tags": p.tags,
|
||||
"description_html":p.body_html,
|
||||
"published_at": p.published_at,
|
||||
"created_at": p.created_at,
|
||||
"updated_at": p.updated_at,
|
||||
"variant_count": variants.len(),
|
||||
"image_count": images.len(),
|
||||
"any_available": any_available,
|
||||
"price_min": prices.iter().cloned().fold(f64::INFINITY, f64::min).is_finite().then(|| prices.iter().cloned().fold(f64::INFINITY, f64::min)),
|
||||
"price_max": prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max).is_finite().then(|| prices.iter().cloned().fold(f64::NEG_INFINITY, f64::max)),
|
||||
"variants": variants,
|
||||
"images": images,
|
||||
"options": options,
|
||||
}))
|
||||
}
|
||||
|
||||
/// Build the .json path from a product URL. Handles pre-.jsoned URLs,
|
||||
/// trailing slashes, and query strings.
|
||||
fn build_json_url(url: &str) -> String {
|
||||
let (path_part, query_part) = match url.split_once('?') {
|
||||
Some((a, b)) => (a, Some(b)),
|
||||
None => (url, None),
|
||||
};
|
||||
let clean = path_part.trim_end_matches('/');
|
||||
let with_json = if clean.ends_with(".json") {
|
||||
clean.to_string()
|
||||
} else {
|
||||
format!("{clean}.json")
|
||||
};
|
||||
match query_part {
|
||||
Some(q) => format!("{with_json}?{q}"),
|
||||
None => with_json,
|
||||
}
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Shopify product JSON shape (a subset of the full response)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Wrapper {
|
||||
product: Product,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Product {
|
||||
id: Option<i64>,
|
||||
title: Option<String>,
|
||||
handle: Option<String>,
|
||||
vendor: Option<String>,
|
||||
product_type: Option<String>,
|
||||
body_html: Option<String>,
|
||||
published_at: Option<String>,
|
||||
created_at: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
#[serde(default)]
|
||||
tags: serde_json::Value, // array OR comma-joined string depending on store
|
||||
#[serde(default)]
|
||||
variants: Vec<Variant>,
|
||||
#[serde(default)]
|
||||
images: Vec<Image>,
|
||||
#[serde(default)]
|
||||
options: Vec<Option_>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Variant {
|
||||
id: Option<i64>,
|
||||
title: Option<String>,
|
||||
sku: Option<String>,
|
||||
barcode: Option<String>,
|
||||
price: Option<String>,
|
||||
compare_at_price: Option<String>,
|
||||
available: Option<bool>,
|
||||
inventory_quantity: Option<i64>,
|
||||
position: Option<i64>,
|
||||
weight: Option<f64>,
|
||||
weight_unit: Option<String>,
|
||||
requires_shipping: Option<bool>,
|
||||
taxable: Option<bool>,
|
||||
option1: Option<String>,
|
||||
option2: Option<String>,
|
||||
option3: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Image {
|
||||
src: Option<String>,
|
||||
width: Option<i64>,
|
||||
height: Option<i64>,
|
||||
position: Option<i64>,
|
||||
alt: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
#[serde(rename_all = "lowercase")]
|
||||
struct Option_ {
|
||||
name: Option<String>,
|
||||
position: Option<i64>,
|
||||
#[serde(default)]
|
||||
values: Vec<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_plausible_shopify_urls() {
|
||||
assert!(matches(
|
||||
"https://www.allbirds.com/products/mens-tree-runners"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://shop.example.com/products/cool-tshirt?variant=123"
|
||||
));
|
||||
assert!(matches("https://somestore.myshopify.com/products/thing-1"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_known_non_shopify() {
|
||||
assert!(!matches("https://www.amazon.com/dp/B0C123"));
|
||||
assert!(!matches("https://www.etsy.com/listing/12345/foo"));
|
||||
assert!(!matches("https://www.amazon.co.uk/products/thing"));
|
||||
assert!(!matches("https://github.com/products"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_product_urls() {
|
||||
assert!(!matches("https://example.com/"));
|
||||
assert!(!matches("https://example.com/products/"));
|
||||
assert!(!matches("https://example.com/collections/all"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn build_json_url_handles_slash_and_query() {
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo"),
|
||||
"https://shop.example.com/products/foo.json"
|
||||
);
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo/"),
|
||||
"https://shop.example.com/products/foo.json"
|
||||
);
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo?variant=123"),
|
||||
"https://shop.example.com/products/foo.json?variant=123"
|
||||
);
|
||||
assert_eq!(
|
||||
build_json_url("https://shop.example.com/products/foo.json"),
|
||||
"https://shop.example.com/products/foo.json"
|
||||
);
|
||||
}
|
||||
}
|
||||
216
crates/webclaw-fetch/src/extractors/stackoverflow.rs
Normal file
216
crates/webclaw-fetch/src/extractors/stackoverflow.rs
Normal file
|
|
@ -0,0 +1,216 @@
|
|||
//! Stack Overflow Q&A structured extractor.
|
||||
//!
|
||||
//! Uses the Stack Exchange API at `api.stackexchange.com/2.3/questions/{id}`
|
||||
//! with `site=stackoverflow`. Two calls: one for the question, one for
|
||||
//! its answers. Both come pre-filtered to include the rendered HTML body
|
||||
//! so we don't re-parse the question page itself.
|
||||
//!
|
||||
//! Anonymous access caps at 300 requests per IP per day. Production
|
||||
//! cloud should set `STACKAPPS_KEY` to lift to 10,000/day, but we don't
|
||||
//! require it to work out of the box.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "stackoverflow",
|
||||
label: "Stack Overflow Q&A",
|
||||
description: "Returns question + answers: title, body, tags, votes, accepted answer, top answers.",
|
||||
url_patterns: &["https://stackoverflow.com/questions/{id}/{slug}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host != "stackoverflow.com" && host != "www.stackoverflow.com" {
|
||||
return false;
|
||||
}
|
||||
parse_question_id(url).is_some()
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let id = parse_question_id(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"stackoverflow: cannot parse question id from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
// Filter `withbody` includes the rendered HTML body for both questions
|
||||
// and answers. Stack Exchange's filter system is documented at
|
||||
// api.stackexchange.com/docs/filters.
|
||||
let q_url = format!(
|
||||
"https://api.stackexchange.com/2.3/questions/{id}?site=stackoverflow&filter=withbody"
|
||||
);
|
||||
let q_resp = client.fetch(&q_url).await?;
|
||||
if q_resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"stackexchange api returned status {}",
|
||||
q_resp.status
|
||||
)));
|
||||
}
|
||||
let q_body: QResponse = serde_json::from_str(&q_resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("stackoverflow q parse: {e}")))?;
|
||||
let q = q_body
|
||||
.items
|
||||
.first()
|
||||
.ok_or_else(|| FetchError::Build(format!("stackoverflow: question {id} not found")))?;
|
||||
|
||||
let a_url = format!(
|
||||
"https://api.stackexchange.com/2.3/questions/{id}/answers?site=stackoverflow&filter=withbody&order=desc&sort=votes"
|
||||
);
|
||||
let a_resp = client.fetch(&a_url).await?;
|
||||
let answers = if a_resp.status == 200 {
|
||||
let a_body: AResponse = serde_json::from_str(&a_resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("stackoverflow a parse: {e}")))?;
|
||||
a_body
|
||||
.items
|
||||
.iter()
|
||||
.map(|a| {
|
||||
json!({
|
||||
"answer_id": a.answer_id,
|
||||
"is_accepted": a.is_accepted,
|
||||
"score": a.score,
|
||||
"body": a.body,
|
||||
"creation_date": a.creation_date,
|
||||
"last_edit_date":a.last_edit_date,
|
||||
"author": a.owner.as_ref().and_then(|o| o.display_name.clone()),
|
||||
"author_rep": a.owner.as_ref().and_then(|o| o.reputation),
|
||||
})
|
||||
})
|
||||
.collect::<Vec<_>>()
|
||||
} else {
|
||||
Vec::new()
|
||||
};
|
||||
|
||||
let accepted = answers
|
||||
.iter()
|
||||
.find(|a| {
|
||||
a.get("is_accepted")
|
||||
.and_then(|v| v.as_bool())
|
||||
.unwrap_or(false)
|
||||
})
|
||||
.cloned();
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"question_id": q.question_id,
|
||||
"title": q.title,
|
||||
"body": q.body,
|
||||
"tags": q.tags,
|
||||
"score": q.score,
|
||||
"view_count": q.view_count,
|
||||
"answer_count": q.answer_count,
|
||||
"is_answered": q.is_answered,
|
||||
"accepted_answer_id": q.accepted_answer_id,
|
||||
"creation_date": q.creation_date,
|
||||
"last_activity_date": q.last_activity_date,
|
||||
"author": q.owner.as_ref().and_then(|o| o.display_name.clone()),
|
||||
"author_rep": q.owner.as_ref().and_then(|o| o.reputation),
|
||||
"link": q.link,
|
||||
"accepted_answer": accepted,
|
||||
"top_answers": answers,
|
||||
}))
|
||||
}
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Parse question id from a URL of the form `/questions/{id}/{slug}`.
|
||||
fn parse_question_id(url: &str) -> Option<u64> {
|
||||
let after = url.split("/questions/").nth(1)?;
|
||||
let stripped = after.split(['?', '#']).next()?.trim_end_matches('/');
|
||||
let first = stripped.split('/').next()?;
|
||||
first.parse::<u64>().ok()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Stack Exchange API types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct QResponse {
|
||||
#[serde(default)]
|
||||
items: Vec<Question>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Question {
|
||||
question_id: Option<u64>,
|
||||
title: Option<String>,
|
||||
body: Option<String>,
|
||||
#[serde(default)]
|
||||
tags: Vec<String>,
|
||||
score: Option<i64>,
|
||||
view_count: Option<i64>,
|
||||
answer_count: Option<i64>,
|
||||
is_answered: Option<bool>,
|
||||
accepted_answer_id: Option<u64>,
|
||||
creation_date: Option<i64>,
|
||||
last_activity_date: Option<i64>,
|
||||
owner: Option<Owner>,
|
||||
link: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct AResponse {
|
||||
#[serde(default)]
|
||||
items: Vec<Answer>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Answer {
|
||||
answer_id: Option<u64>,
|
||||
is_accepted: Option<bool>,
|
||||
score: Option<i64>,
|
||||
body: Option<String>,
|
||||
creation_date: Option<i64>,
|
||||
last_edit_date: Option<i64>,
|
||||
owner: Option<Owner>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Owner {
|
||||
display_name: Option<String>,
|
||||
reputation: Option<i64>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_question_urls() {
|
||||
assert!(matches(
|
||||
"https://stackoverflow.com/questions/12345/some-slug"
|
||||
));
|
||||
assert!(matches(
|
||||
"https://stackoverflow.com/questions/12345/some-slug?answertab=votes"
|
||||
));
|
||||
assert!(!matches("https://stackoverflow.com/"));
|
||||
assert!(!matches("https://stackoverflow.com/questions"));
|
||||
assert!(!matches("https://stackoverflow.com/users/100"));
|
||||
assert!(!matches("https://example.com/questions/12345/x"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_question_id_handles_slug_and_query() {
|
||||
assert_eq!(
|
||||
parse_question_id("https://stackoverflow.com/questions/12345/some-slug"),
|
||||
Some(12345)
|
||||
);
|
||||
assert_eq!(
|
||||
parse_question_id("https://stackoverflow.com/questions/12345/some-slug?tab=newest"),
|
||||
Some(12345)
|
||||
);
|
||||
assert_eq!(parse_question_id("https://stackoverflow.com/foo"), None);
|
||||
}
|
||||
}
|
||||
565
crates/webclaw-fetch/src/extractors/substack_post.rs
Normal file
565
crates/webclaw-fetch/src/extractors/substack_post.rs
Normal file
|
|
@ -0,0 +1,565 @@
|
|||
//! Substack post extractor.
|
||||
//!
|
||||
//! Every Substack publication exposes `/api/v1/posts/{slug}` that
|
||||
//! returns the full post as JSON: body HTML, cover image, author,
|
||||
//! publication info, reactions, paywall state. No auth on public
|
||||
//! posts.
|
||||
//!
|
||||
//! Works on both `*.substack.com` subdomains and custom domains
|
||||
//! (e.g. `simonwillison.net` uses Substack too). Detection is
|
||||
//! "URL has `/p/{slug}`" because that's the canonical Substack post
|
||||
//! path. Explicit-call only because the `/p/{slug}` URL shape is
|
||||
//! used by non-Substack sites too.
|
||||
//!
|
||||
//! ## Fallback
|
||||
//!
|
||||
//! The API endpoint is rate-limited aggressively on popular publications
|
||||
//! and occasionally returns 403 on custom domains with Cloudflare in
|
||||
//! front. When that happens we escalate to an HTML fetch (via
|
||||
//! `smart_fetch_html`, so antibot-protected custom domains still work)
|
||||
//! and extract OG tags + Article JSON-LD for a degraded-but-useful
|
||||
//! payload. The response shape stays stable across both paths; a
|
||||
//! `data_source` field tells the caller which branch ran.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "substack_post",
|
||||
label: "Substack post",
|
||||
description: "Returns post HTML, title, subtitle, author, publication, reactions, paywall status via the Substack public API. Falls back to OG + JSON-LD HTML parsing when the API is rate-limited.",
|
||||
url_patterns: &[
|
||||
"https://{pub}.substack.com/p/{slug}",
|
||||
"https://{custom-domain}/p/{slug}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
if !(url.starts_with("http://") || url.starts_with("https://")) {
|
||||
return false;
|
||||
}
|
||||
url.contains("/p/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let slug = parse_slug(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("substack_post: cannot parse slug from '{url}'"))
|
||||
})?;
|
||||
let host = host_of(url);
|
||||
if host.is_empty() {
|
||||
return Err(FetchError::Build(format!(
|
||||
"substack_post: empty host in '{url}'"
|
||||
)));
|
||||
}
|
||||
let scheme = if url.starts_with("http://") {
|
||||
"http"
|
||||
} else {
|
||||
"https"
|
||||
};
|
||||
let api_url = format!("{scheme}://{host}/api/v1/posts/{slug}");
|
||||
|
||||
// 1. Try the public API. 200 = full payload; 404 = real miss; any
|
||||
// other status hands off to the HTML fallback so a transient rate
|
||||
// limit or a hardened custom domain doesn't fail the whole call.
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
match resp.status {
|
||||
200 => match serde_json::from_str::<Post>(&resp.html) {
|
||||
Ok(p) => Ok(build_api_payload(url, &api_url, &slug, p)),
|
||||
Err(e) => {
|
||||
// API returned 200 but the body isn't the Post shape we
|
||||
// expect. Could be a custom-domain site that exposes
|
||||
// something else at /api/v1/posts/. Fall back to HTML
|
||||
// rather than hard-failing.
|
||||
html_fallback(
|
||||
client,
|
||||
url,
|
||||
&api_url,
|
||||
&slug,
|
||||
Some(format!(
|
||||
"api returned 200 but body was not Substack JSON ({e})"
|
||||
)),
|
||||
)
|
||||
.await
|
||||
}
|
||||
},
|
||||
404 => Err(FetchError::Build(format!(
|
||||
"substack_post: '{slug}' not found on {host} (got 404). \
|
||||
If the publication isn't actually on Substack, use /v1/scrape instead."
|
||||
))),
|
||||
_ => {
|
||||
// Rate limit, 403, 5xx, whatever: try HTML.
|
||||
let reason = format!("api returned status {} for {api_url}", resp.status);
|
||||
html_fallback(client, url, &api_url, &slug, Some(reason)).await
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// API-path payload builder
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn build_api_payload(url: &str, api_url: &str, slug: &str, p: Post) -> Value {
|
||||
json!({
|
||||
"url": url,
|
||||
"api_url": api_url,
|
||||
"data_source": "api",
|
||||
"id": p.id,
|
||||
"type": p.r#type,
|
||||
"slug": p.slug.or_else(|| Some(slug.to_string())),
|
||||
"title": p.title,
|
||||
"subtitle": p.subtitle,
|
||||
"description": p.description,
|
||||
"canonical_url": p.canonical_url,
|
||||
"post_date": p.post_date,
|
||||
"updated_at": p.updated_at,
|
||||
"audience": p.audience,
|
||||
"has_paywall": matches!(p.audience.as_deref(), Some("only_paid") | Some("founding")),
|
||||
"is_free_preview": p.is_free_preview,
|
||||
"cover_image": p.cover_image,
|
||||
"word_count": p.wordcount,
|
||||
"reactions": p.reactions,
|
||||
"comment_count": p.comment_count,
|
||||
"body_html": p.body_html,
|
||||
"body_text": p.truncated_body_text.or(p.body_text),
|
||||
"publication": json!({
|
||||
"id": p.publication.as_ref().and_then(|pub_| pub_.id),
|
||||
"name": p.publication.as_ref().and_then(|pub_| pub_.name.clone()),
|
||||
"subdomain": p.publication.as_ref().and_then(|pub_| pub_.subdomain.clone()),
|
||||
"custom_domain":p.publication.as_ref().and_then(|pub_| pub_.custom_domain.clone()),
|
||||
}),
|
||||
"authors": p.published_bylines.iter().map(|a| json!({
|
||||
"id": a.id,
|
||||
"name": a.name,
|
||||
"handle": a.handle,
|
||||
"photo": a.photo_url,
|
||||
})).collect::<Vec<_>>(),
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// HTML fallback: OG + Article JSON-LD
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
async fn html_fallback(
|
||||
client: &dyn Fetcher,
|
||||
url: &str,
|
||||
api_url: &str,
|
||||
slug: &str,
|
||||
fallback_reason: Option<String>,
|
||||
) -> Result<Value, FetchError> {
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let mut data = parse_html(&fetched.html, url, api_url, slug);
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"fetch_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
if let Some(reason) = fallback_reason {
|
||||
obj.insert("fallback_reason".into(), json!(reason));
|
||||
}
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
/// Pure HTML parser. Pulls title, subtitle, description, cover image,
|
||||
/// publish date, and authors from OG tags and Article JSON-LD. Kept
|
||||
/// public so tests can exercise it with fixtures.
|
||||
pub fn parse_html(html: &str, url: &str, api_url: &str, slug: &str) -> Value {
|
||||
let article = find_article_jsonld(html);
|
||||
|
||||
let title = article
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "headline"))
|
||||
.or_else(|| og(html, "title"));
|
||||
let description = article
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "description"))
|
||||
.or_else(|| og(html, "description"));
|
||||
let cover_image = article
|
||||
.as_ref()
|
||||
.and_then(get_first_image)
|
||||
.or_else(|| og(html, "image"));
|
||||
let post_date = article
|
||||
.as_ref()
|
||||
.and_then(|v| get_text(v, "datePublished"))
|
||||
.or_else(|| meta_property(html, "article:published_time"));
|
||||
let updated_at = article.as_ref().and_then(|v| get_text(v, "dateModified"));
|
||||
let publication_name = og(html, "site_name");
|
||||
let authors = article.as_ref().map(extract_authors).unwrap_or_default();
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"api_url": api_url,
|
||||
"data_source": "html_fallback",
|
||||
"slug": slug,
|
||||
"title": title,
|
||||
"subtitle": None::<String>,
|
||||
"description": description,
|
||||
"canonical_url": canonical_url(html).or_else(|| Some(url.to_string())),
|
||||
"post_date": post_date,
|
||||
"updated_at": updated_at,
|
||||
"cover_image": cover_image,
|
||||
"body_html": None::<String>,
|
||||
"body_text": None::<String>,
|
||||
"word_count": None::<i64>,
|
||||
"comment_count": None::<i64>,
|
||||
"reactions": Value::Null,
|
||||
"has_paywall": None::<bool>,
|
||||
"is_free_preview": None::<bool>,
|
||||
"publication": json!({
|
||||
"name": publication_name,
|
||||
}),
|
||||
"authors": authors,
|
||||
})
|
||||
}
|
||||
|
||||
fn extract_authors(v: &Value) -> Vec<Value> {
|
||||
let Some(a) = v.get("author") else {
|
||||
return Vec::new();
|
||||
};
|
||||
let one = |val: &Value| -> Option<Value> {
|
||||
match val {
|
||||
Value::String(s) => Some(json!({"name": s})),
|
||||
Value::Object(_) => {
|
||||
let name = val.get("name").and_then(|n| n.as_str())?;
|
||||
let handle = val
|
||||
.get("url")
|
||||
.and_then(|u| u.as_str())
|
||||
.and_then(handle_from_author_url);
|
||||
Some(json!({
|
||||
"name": name,
|
||||
"handle": handle,
|
||||
}))
|
||||
}
|
||||
_ => None,
|
||||
}
|
||||
};
|
||||
match a {
|
||||
Value::Array(arr) => arr.iter().filter_map(one).collect(),
|
||||
_ => one(a).into_iter().collect(),
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
fn parse_slug(url: &str) -> Option<String> {
|
||||
let after = url.split("/p/").nth(1)?;
|
||||
let stripped = after
|
||||
.split(['?', '#'])
|
||||
.next()?
|
||||
.trim_end_matches('/')
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if stripped.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(stripped.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
/// Extract the Substack handle from an author URL like
|
||||
/// `https://substack.com/@handle` or `https://pub.substack.com/@handle`.
|
||||
///
|
||||
/// Returns `None` when the URL has no `@` segment (e.g. a non-Substack
|
||||
/// author page) so we don't synthesise a fake handle.
|
||||
fn handle_from_author_url(u: &str) -> Option<String> {
|
||||
let after = u.rsplit_once('@').map(|(_, tail)| tail)?;
|
||||
let clean = after.split(['/', '?', '#']).next()?;
|
||||
if clean.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(clean.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// HTML tag helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Pull `<meta property="article:published_time" content="...">` and
|
||||
/// similar structured meta tags.
|
||||
fn meta_property(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn canonical_url(html: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE
|
||||
.get_or_init(|| Regex::new(r#"(?i)<link[^>]+rel="canonical"[^>]+href="([^"]+)""#).unwrap());
|
||||
re.captures(html)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD walkers (Article / NewsArticle)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn find_article_jsonld(html: &str) -> Option<Value> {
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
for b in blocks {
|
||||
if let Some(found) = find_article_in(&b) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn find_article_in(v: &Value) -> Option<Value> {
|
||||
if is_article_type(v) {
|
||||
return Some(v.clone());
|
||||
}
|
||||
if let Some(graph) = v.get("@graph").and_then(|g| g.as_array()) {
|
||||
for item in graph {
|
||||
if let Some(found) = find_article_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
if let Some(arr) = v.as_array() {
|
||||
for item in arr {
|
||||
if let Some(found) = find_article_in(item) {
|
||||
return Some(found);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn is_article_type(v: &Value) -> bool {
|
||||
let Some(t) = v.get("@type") else {
|
||||
return false;
|
||||
};
|
||||
let is_art = |s: &str| {
|
||||
matches!(
|
||||
s,
|
||||
"Article" | "NewsArticle" | "BlogPosting" | "SocialMediaPosting"
|
||||
)
|
||||
};
|
||||
match t {
|
||||
Value::String(s) => is_art(s),
|
||||
Value::Array(arr) => arr.iter().any(|x| x.as_str().is_some_and(is_art)),
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
fn get_text(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Number(n) => Some(n.to_string()),
|
||||
_ => None,
|
||||
})
|
||||
}
|
||||
|
||||
fn get_first_image(v: &Value) -> Option<String> {
|
||||
match v.get("image")? {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Array(arr) => arr.iter().find_map(|x| match x {
|
||||
Value::String(s) => Some(s.clone()),
|
||||
Value::Object(_) => x.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}),
|
||||
Value::Object(o) => o.get("url").and_then(|u| u.as_str()).map(String::from),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Substack API types (subset)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Post {
|
||||
id: Option<i64>,
|
||||
r#type: Option<String>,
|
||||
slug: Option<String>,
|
||||
title: Option<String>,
|
||||
subtitle: Option<String>,
|
||||
description: Option<String>,
|
||||
canonical_url: Option<String>,
|
||||
post_date: Option<String>,
|
||||
updated_at: Option<String>,
|
||||
audience: Option<String>,
|
||||
is_free_preview: Option<bool>,
|
||||
cover_image: Option<String>,
|
||||
wordcount: Option<i64>,
|
||||
reactions: Option<serde_json::Value>,
|
||||
comment_count: Option<i64>,
|
||||
body_html: Option<String>,
|
||||
body_text: Option<String>,
|
||||
truncated_body_text: Option<String>,
|
||||
publication: Option<Publication>,
|
||||
#[serde(default, rename = "publishedBylines")]
|
||||
published_bylines: Vec<Byline>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Publication {
|
||||
id: Option<i64>,
|
||||
name: Option<String>,
|
||||
subdomain: Option<String>,
|
||||
custom_domain: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Byline {
|
||||
id: Option<i64>,
|
||||
name: Option<String>,
|
||||
handle: Option<String>,
|
||||
photo_url: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_post_urls() {
|
||||
assert!(matches(
|
||||
"https://stratechery.substack.com/p/the-tech-letter"
|
||||
));
|
||||
assert!(matches("https://simonwillison.net/p/2024-08-01-something"));
|
||||
assert!(!matches("https://example.com/"));
|
||||
assert!(!matches("ftp://example.com/p/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_slug_strips_query_and_trailing_slash() {
|
||||
assert_eq!(
|
||||
parse_slug("https://example.substack.com/p/my-post"),
|
||||
Some("my-post".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://example.substack.com/p/my-post/"),
|
||||
Some("my-post".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://example.substack.com/p/my-post?ref=123"),
|
||||
Some("my-post".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_html_extracts_from_og_tags() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<meta property="og:title" content="My Great Post">
|
||||
<meta property="og:description" content="A short summary.">
|
||||
<meta property="og:image" content="https://cdn.substack.com/cover.jpg">
|
||||
<meta property="og:site_name" content="My Publication">
|
||||
<meta property="article:published_time" content="2025-09-01T10:00:00Z">
|
||||
<link rel="canonical" href="https://mypub.substack.com/p/my-post">
|
||||
</head></html>"##;
|
||||
let v = parse_html(
|
||||
html,
|
||||
"https://mypub.substack.com/p/my-post",
|
||||
"https://mypub.substack.com/api/v1/posts/my-post",
|
||||
"my-post",
|
||||
);
|
||||
assert_eq!(v["data_source"], "html_fallback");
|
||||
assert_eq!(v["title"], "My Great Post");
|
||||
assert_eq!(v["description"], "A short summary.");
|
||||
assert_eq!(v["cover_image"], "https://cdn.substack.com/cover.jpg");
|
||||
assert_eq!(v["post_date"], "2025-09-01T10:00:00Z");
|
||||
assert_eq!(v["publication"]["name"], "My Publication");
|
||||
assert_eq!(v["canonical_url"], "https://mypub.substack.com/p/my-post");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_html_prefers_jsonld_when_present() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<meta property="og:title" content="OG Title">
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@type":"NewsArticle",
|
||||
"headline":"JSON-LD Title",
|
||||
"description":"JSON-LD desc.",
|
||||
"image":"https://cdn.substack.com/hero.jpg",
|
||||
"datePublished":"2025-10-12T08:30:00Z",
|
||||
"dateModified":"2025-10-12T09:00:00Z",
|
||||
"author":[{"@type":"Person","name":"Alice Author","url":"https://substack.com/@alice"}]}
|
||||
</script>
|
||||
</head></html>"##;
|
||||
let v = parse_html(
|
||||
html,
|
||||
"https://example.com/p/a",
|
||||
"https://example.com/api/v1/posts/a",
|
||||
"a",
|
||||
);
|
||||
assert_eq!(v["title"], "JSON-LD Title");
|
||||
assert_eq!(v["description"], "JSON-LD desc.");
|
||||
assert_eq!(v["cover_image"], "https://cdn.substack.com/hero.jpg");
|
||||
assert_eq!(v["post_date"], "2025-10-12T08:30:00Z");
|
||||
assert_eq!(v["updated_at"], "2025-10-12T09:00:00Z");
|
||||
assert_eq!(v["authors"][0]["name"], "Alice Author");
|
||||
assert_eq!(v["authors"][0]["handle"], "alice");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handle_from_author_url_pulls_handle() {
|
||||
assert_eq!(
|
||||
handle_from_author_url("https://substack.com/@alice"),
|
||||
Some("alice".into())
|
||||
);
|
||||
assert_eq!(
|
||||
handle_from_author_url("https://mypub.substack.com/@bob/"),
|
||||
Some("bob".into())
|
||||
);
|
||||
assert_eq!(
|
||||
handle_from_author_url("https://not-substack.com/author/carol"),
|
||||
None
|
||||
);
|
||||
}
|
||||
}
|
||||
572
crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
Normal file
572
crates/webclaw-fetch/src/extractors/trustpilot_reviews.rs
Normal file
|
|
@ -0,0 +1,572 @@
|
|||
//! Trustpilot company reviews extractor.
|
||||
//!
|
||||
//! `trustpilot.com/review/{domain}` pages are always behind AWS WAF's
|
||||
//! "Verifying your connection" interstitial, so this extractor always
|
||||
//! routes through [`cloud::smart_fetch_html`]. Without
|
||||
//! `WEBCLAW_API_KEY` / `WEBCLAW_CLOUD_API_KEY` it returns a clean
|
||||
//! "set API key" error; with one it escalates to api.webclaw.io.
|
||||
//!
|
||||
//! ## 2025 JSON-LD schema
|
||||
//!
|
||||
//! Trustpilot replaced the old single-Organization + aggregateRating
|
||||
//! shape with three separate JSON-LD blocks:
|
||||
//!
|
||||
//! 1. `Organization` block for Trustpilot the platform itself
|
||||
//! (company info, addresses, social profiles). Not the business
|
||||
//! being reviewed. We detect and skip this.
|
||||
//! 2. `Dataset` block with a csvw:Table mainEntity that contains the
|
||||
//! per-star-bucket counts for the target business plus a Total
|
||||
//! column. The Dataset's `name` is the business display name.
|
||||
//! 3. `aiSummary` + `aiSummaryReviews` block: the AI-generated
|
||||
//! summary of reviews plus the individual review objects
|
||||
//! (consumer, dates, rating, title, text, language, likes).
|
||||
//!
|
||||
//! Plus `metadata.title` from the page head parses as
|
||||
//! `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"` and
|
||||
//! `metadata.description` carries `"{N} customers have already said"`.
|
||||
//! We use both as extra signal when the Dataset block is absent.
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::cloud::{self, CloudError};
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "trustpilot_reviews",
|
||||
label: "Trustpilot reviews",
|
||||
description: "Returns business name, aggregate rating, star distribution, recent reviews, and the AI summary for a Trustpilot /review/{domain} page.",
|
||||
url_patterns: &["https://www.trustpilot.com/review/{domain}"],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if !matches!(host, "www.trustpilot.com" | "trustpilot.com") {
|
||||
return false;
|
||||
}
|
||||
url.contains("/review/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let fetched = cloud::smart_fetch_html(client, client.cloud(), url)
|
||||
.await
|
||||
.map_err(cloud_to_fetch_err)?;
|
||||
|
||||
let mut data = parse(&fetched.html, url)?;
|
||||
if let Some(obj) = data.as_object_mut() {
|
||||
obj.insert(
|
||||
"data_source".into(),
|
||||
match fetched.source {
|
||||
cloud::FetchSource::Local => json!("local"),
|
||||
cloud::FetchSource::Cloud => json!("cloud"),
|
||||
},
|
||||
);
|
||||
}
|
||||
Ok(data)
|
||||
}
|
||||
|
||||
/// Pure parser. Kept public so the cloud pipeline can reuse it on its
|
||||
/// own fetched HTML without going through the async extract path.
|
||||
pub fn parse(html: &str, url: &str) -> Result<Value, FetchError> {
|
||||
let domain = parse_review_domain(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"trustpilot_reviews: cannot parse /review/{{domain}} from '{url}'"
|
||||
))
|
||||
})?;
|
||||
|
||||
let blocks = webclaw_core::structured_data::extract_json_ld(html);
|
||||
|
||||
// The business Dataset block has `about.@id` pointing to the target
|
||||
// domain's Organization (e.g. `.../Organization/anthropic.com`).
|
||||
let dataset = find_business_dataset(&blocks, &domain);
|
||||
|
||||
// The aiSummary block: not typed (no `@type`), detect by key.
|
||||
let ai_block = find_ai_summary_block(&blocks);
|
||||
|
||||
// Business name: Dataset > metadata.title regex > URL domain.
|
||||
let business_name = dataset
|
||||
.as_ref()
|
||||
.and_then(|d| get_string(d, "name"))
|
||||
.or_else(|| parse_name_from_og_title(html))
|
||||
.or_else(|| Some(domain.clone()));
|
||||
|
||||
// Rating distribution from the csvw:Table columns. Each column has
|
||||
// csvw:name like "1 star" / "Total" and a single cell with the
|
||||
// integer count.
|
||||
let distribution = dataset.as_ref().and_then(parse_star_distribution);
|
||||
let (rating_from_dist, total_from_dist) = distribution
|
||||
.as_ref()
|
||||
.map(compute_rating_stats)
|
||||
.unwrap_or((None, None));
|
||||
|
||||
// Page-title / page-description fallbacks. OG title format:
|
||||
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||
let (rating_label, rating_from_og) = parse_rating_from_og_title(html);
|
||||
let total_from_desc = parse_review_count_from_og_description(html);
|
||||
|
||||
// Recent reviews carried by the aiSummary block.
|
||||
let recent_reviews: Vec<Value> = ai_block
|
||||
.as_ref()
|
||||
.and_then(|a| a.get("aiSummaryReviews"))
|
||||
.and_then(|arr| arr.as_array())
|
||||
.map(|arr| arr.iter().map(extract_review).collect())
|
||||
.unwrap_or_default();
|
||||
|
||||
let ai_summary = ai_block
|
||||
.as_ref()
|
||||
.and_then(|a| a.get("aiSummary"))
|
||||
.and_then(|s| s.get("summary"))
|
||||
.and_then(|t| t.as_str())
|
||||
.map(String::from);
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"domain": domain,
|
||||
"business_name": business_name,
|
||||
"rating_label": rating_label,
|
||||
"average_rating": rating_from_dist.or(rating_from_og),
|
||||
"review_count": total_from_dist.or(total_from_desc),
|
||||
"rating_distribution": distribution,
|
||||
"ai_summary": ai_summary,
|
||||
"recent_reviews": recent_reviews,
|
||||
"review_count_listed": recent_reviews.len(),
|
||||
}))
|
||||
}
|
||||
|
||||
fn cloud_to_fetch_err(e: CloudError) -> FetchError {
|
||||
FetchError::Build(e.to_string())
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Pull the target domain from `trustpilot.com/review/{domain}`.
|
||||
fn parse_review_domain(url: &str) -> Option<String> {
|
||||
let after = url.split("/review/").nth(1)?;
|
||||
let stripped = after
|
||||
.split(['?', '#'])
|
||||
.next()?
|
||||
.trim_end_matches('/')
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if stripped.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(stripped.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// JSON-LD block walkers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Find the Dataset block whose `about.@id` references the target
|
||||
/// domain's Organization. Falls through to any Dataset if the @id
|
||||
/// check doesn't match (Trustpilot occasionally varies the URL).
|
||||
fn find_business_dataset(blocks: &[Value], domain: &str) -> Option<Value> {
|
||||
let mut fallback_any_dataset: Option<Value> = None;
|
||||
for block in blocks {
|
||||
for node in walk_graph(block) {
|
||||
if !is_dataset(&node) {
|
||||
continue;
|
||||
}
|
||||
if dataset_about_matches_domain(&node, domain) {
|
||||
return Some(node);
|
||||
}
|
||||
if fallback_any_dataset.is_none() {
|
||||
fallback_any_dataset = Some(node);
|
||||
}
|
||||
}
|
||||
}
|
||||
fallback_any_dataset
|
||||
}
|
||||
|
||||
fn is_dataset(v: &Value) -> bool {
|
||||
v.get("@type")
|
||||
.and_then(|t| t.as_str())
|
||||
.is_some_and(|s| s == "Dataset")
|
||||
}
|
||||
|
||||
fn dataset_about_matches_domain(v: &Value, domain: &str) -> bool {
|
||||
let about_id = v
|
||||
.get("about")
|
||||
.and_then(|a| a.get("@id"))
|
||||
.and_then(|id| id.as_str());
|
||||
let Some(id) = about_id else {
|
||||
return false;
|
||||
};
|
||||
id.contains(&format!("/Organization/{domain}"))
|
||||
}
|
||||
|
||||
/// The aiSummary / aiSummaryReviews block has no `@type`, so match by
|
||||
/// presence of the `aiSummary` key.
|
||||
fn find_ai_summary_block(blocks: &[Value]) -> Option<Value> {
|
||||
for block in blocks {
|
||||
for node in walk_graph(block) {
|
||||
if node.get("aiSummary").is_some() {
|
||||
return Some(node);
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Flatten each block (and its `@graph`) into a list of nodes we can
|
||||
/// iterate over. Handles both `@graph: [ ... ]` (array) and
|
||||
/// `@graph: { ... }` (single object) shapes — Trustpilot uses both.
|
||||
fn walk_graph(block: &Value) -> Vec<Value> {
|
||||
let mut out = vec![block.clone()];
|
||||
if let Some(graph) = block.get("@graph") {
|
||||
match graph {
|
||||
Value::Array(arr) => out.extend(arr.iter().cloned()),
|
||||
Value::Object(_) => out.push(graph.clone()),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Rating distribution (csvw:Table)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Parse the per-star distribution from the Dataset block. Returns
|
||||
/// `{"1_star": {count, percent}, ..., "total": {count, percent}}`.
|
||||
fn parse_star_distribution(dataset: &Value) -> Option<Value> {
|
||||
let columns = dataset
|
||||
.get("mainEntity")?
|
||||
.get("csvw:tableSchema")?
|
||||
.get("csvw:columns")?
|
||||
.as_array()?;
|
||||
let mut out = serde_json::Map::new();
|
||||
for col in columns {
|
||||
let name = col.get("csvw:name").and_then(|n| n.as_str())?;
|
||||
let cell = col.get("csvw:cells").and_then(|c| c.as_array())?.first()?;
|
||||
let count = cell
|
||||
.get("csvw:value")
|
||||
.and_then(|v| v.as_str())
|
||||
.and_then(|s| s.parse::<i64>().ok());
|
||||
let percent = cell
|
||||
.get("csvw:notes")
|
||||
.and_then(|n| n.as_array())
|
||||
.and_then(|arr| arr.first())
|
||||
.and_then(|s| s.as_str())
|
||||
.map(String::from);
|
||||
let key = normalise_star_key(name);
|
||||
out.insert(
|
||||
key,
|
||||
json!({
|
||||
"count": count,
|
||||
"percent": percent,
|
||||
}),
|
||||
);
|
||||
}
|
||||
if out.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(Value::Object(out))
|
||||
}
|
||||
}
|
||||
|
||||
/// "1 star" -> "one_star", "Total" -> "total". Easier to consume than
|
||||
/// the raw "1 star" key which fights YAML/JS property access.
|
||||
fn normalise_star_key(name: &str) -> String {
|
||||
let trimmed = name.trim().to_lowercase();
|
||||
match trimmed.as_str() {
|
||||
"1 star" => "one_star".into(),
|
||||
"2 stars" => "two_stars".into(),
|
||||
"3 stars" => "three_stars".into(),
|
||||
"4 stars" => "four_stars".into(),
|
||||
"5 stars" => "five_stars".into(),
|
||||
"total" => "total".into(),
|
||||
other => other.replace(' ', "_"),
|
||||
}
|
||||
}
|
||||
|
||||
/// Compute average rating (weighted by bucket) and total count from the
|
||||
/// parsed distribution. Returns `(average, total)`.
|
||||
fn compute_rating_stats(distribution: &Value) -> (Option<String>, Option<i64>) {
|
||||
let Some(obj) = distribution.as_object() else {
|
||||
return (None, None);
|
||||
};
|
||||
let get_count = |key: &str| -> i64 {
|
||||
obj.get(key)
|
||||
.and_then(|v| v.get("count"))
|
||||
.and_then(|v| v.as_i64())
|
||||
.unwrap_or(0)
|
||||
};
|
||||
let one = get_count("one_star");
|
||||
let two = get_count("two_stars");
|
||||
let three = get_count("three_stars");
|
||||
let four = get_count("four_stars");
|
||||
let five = get_count("five_stars");
|
||||
let total_bucket = one + two + three + four + five;
|
||||
let total = obj
|
||||
.get("total")
|
||||
.and_then(|v| v.get("count"))
|
||||
.and_then(|v| v.as_i64())
|
||||
.unwrap_or(total_bucket);
|
||||
if total == 0 {
|
||||
return (None, Some(0));
|
||||
}
|
||||
let weighted = one + (two * 2) + (three * 3) + (four * 4) + (five * 5);
|
||||
let avg = weighted as f64 / total_bucket.max(1) as f64;
|
||||
// One decimal place, matching how Trustpilot displays the score.
|
||||
(Some(format!("{avg:.1}")), Some(total))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// OG / meta-tag fallbacks
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Regex out the business name from the standard Trustpilot OG title
|
||||
/// shape: `"{name} is rated \"{label}\" with {rating} / 5 on Trustpilot"`.
|
||||
fn parse_name_from_og_title(html: &str) -> Option<String> {
|
||||
let title = og(html, "title")?;
|
||||
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"^(.+?)\s+is rated\b").unwrap());
|
||||
re.captures(&title)
|
||||
.and_then(|c| c.get(1))
|
||||
.map(|m| m.as_str().to_string())
|
||||
}
|
||||
|
||||
/// Pull the rating label (e.g. "Bad", "Excellent") and numeric value
|
||||
/// from the OG title.
|
||||
fn parse_rating_from_og_title(html: &str) -> (Option<String>, Option<String>) {
|
||||
let Some(title) = og(html, "title") else {
|
||||
return (None, None);
|
||||
};
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
// "Anthropic is rated \"Bad\" with 1.5 / 5 on Trustpilot"
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"is rated\s*[\\"]+([^"\\]+)[\\"]+\s*with\s*([\d.]+)\s*/\s*5"#).unwrap()
|
||||
});
|
||||
let Some(caps) = re.captures(&title) else {
|
||||
return (None, None);
|
||||
};
|
||||
(
|
||||
caps.get(1).map(|m| m.as_str().trim().to_string()),
|
||||
caps.get(2).map(|m| m.as_str().to_string()),
|
||||
)
|
||||
}
|
||||
|
||||
/// Parse "hear what 226 customers have already said" from the OG
|
||||
/// description tag.
|
||||
fn parse_review_count_from_og_description(html: &str) -> Option<i64> {
|
||||
let desc = og(html, "description")?;
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| Regex::new(r"(\d[\d,]*)\s+customers").unwrap());
|
||||
re.captures(&desc)?
|
||||
.get(1)?
|
||||
.as_str()
|
||||
.replace(',', "")
|
||||
.parse::<i64>()
|
||||
.ok()
|
||||
}
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
let raw = c.get(2).map(|m| m.as_str())?;
|
||||
return Some(html_unescape(raw));
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
/// Minimal HTML entity unescaping for the three entities the
|
||||
/// synthesize_html escaper might produce. Keeps us off a heavier dep.
|
||||
fn html_unescape(s: &str) -> String {
|
||||
s.replace(""", "\"")
|
||||
.replace("&", "&")
|
||||
.replace("<", "<")
|
||||
.replace(">", ">")
|
||||
}
|
||||
|
||||
fn get_string(v: &Value, key: &str) -> Option<String> {
|
||||
v.get(key).and_then(|x| x.as_str().map(String::from))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Review extraction
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn extract_review(r: &Value) -> Value {
|
||||
json!({
|
||||
"id": r.get("id").and_then(|v| v.as_str()),
|
||||
"rating": r.get("rating").and_then(|v| v.as_i64()),
|
||||
"title": r.get("title").and_then(|v| v.as_str()),
|
||||
"text": r.get("text").and_then(|v| v.as_str()),
|
||||
"language": r.get("language").and_then(|v| v.as_str()),
|
||||
"source": r.get("source").and_then(|v| v.as_str()),
|
||||
"likes": r.get("likes").and_then(|v| v.as_i64()),
|
||||
"author": r.get("consumer").and_then(|c| c.get("displayName")).and_then(|v| v.as_str()),
|
||||
"author_country": r.get("consumer").and_then(|c| c.get("countryCode")).and_then(|v| v.as_str()),
|
||||
"author_review_count": r.get("consumer").and_then(|c| c.get("numberOfReviews")).and_then(|v| v.as_i64()),
|
||||
"verified": r.get("consumer").and_then(|c| c.get("isVerified")).and_then(|v| v.as_bool()),
|
||||
"date_experienced": r.get("dates").and_then(|d| d.get("experiencedDate")).and_then(|v| v.as_str()),
|
||||
"date_published": r.get("dates").and_then(|d| d.get("publishedDate")).and_then(|v| v.as_str()),
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Tests
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_trustpilot_review_urls() {
|
||||
assert!(matches("https://www.trustpilot.com/review/stripe.com"));
|
||||
assert!(matches("https://trustpilot.com/review/example.com"));
|
||||
assert!(!matches("https://www.trustpilot.com/"));
|
||||
assert!(!matches("https://example.com/review/foo"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_review_domain_handles_query_and_slash() {
|
||||
assert_eq!(
|
||||
parse_review_domain("https://www.trustpilot.com/review/anthropic.com"),
|
||||
Some("anthropic.com".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_review_domain("https://www.trustpilot.com/review/anthropic.com/"),
|
||||
Some("anthropic.com".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_review_domain("https://www.trustpilot.com/review/anthropic.com?stars=5"),
|
||||
Some("anthropic.com".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn normalise_star_key_covers_all_buckets() {
|
||||
assert_eq!(normalise_star_key("1 star"), "one_star");
|
||||
assert_eq!(normalise_star_key("2 stars"), "two_stars");
|
||||
assert_eq!(normalise_star_key("5 stars"), "five_stars");
|
||||
assert_eq!(normalise_star_key("Total"), "total");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn compute_rating_stats_weighted_average() {
|
||||
// 100 1-stars, 100 5-stars → avg 3.0 over 200 reviews.
|
||||
let dist = json!({
|
||||
"one_star": { "count": 100, "percent": "50%" },
|
||||
"two_stars": { "count": 0, "percent": "0%" },
|
||||
"three_stars":{ "count": 0, "percent": "0%" },
|
||||
"four_stars": { "count": 0, "percent": "0%" },
|
||||
"five_stars": { "count": 100, "percent": "50%" },
|
||||
"total": { "count": 200, "percent": "100%" },
|
||||
});
|
||||
let (avg, total) = compute_rating_stats(&dist);
|
||||
assert_eq!(avg.as_deref(), Some("3.0"));
|
||||
assert_eq!(total, Some(200));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_og_title_extracts_name_and_rating() {
|
||||
let html = r#"<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">"#;
|
||||
assert_eq!(parse_name_from_og_title(html), Some("Anthropic".into()));
|
||||
let (label, rating) = parse_rating_from_og_title(html);
|
||||
assert_eq!(label.as_deref(), Some("Bad"));
|
||||
assert_eq!(rating.as_deref(), Some("1.5"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_review_count_from_og_description_picks_number() {
|
||||
let html = r#"<meta property="og:description" content="Do you agree? Voice your opinion today and hear what 226 customers have already said.">"#;
|
||||
assert_eq!(parse_review_count_from_og_description(html), Some(226));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_full_fixture_assembles_all_fields() {
|
||||
let html = r##"<html><head>
|
||||
<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">
|
||||
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">
|
||||
<script type="application/ld+json">
|
||||
{"@context":"https://schema.org","@graph":[
|
||||
{"@id":"https://www.trustpilot.com/#/schema/Organization/1","@type":"Organization","name":"Trustpilot"}
|
||||
]}
|
||||
</script>
|
||||
<script type="application/ld+json">
|
||||
{"@context":["https://schema.org",{"csvw":"http://www.w3.org/ns/csvw#"}],
|
||||
"@graph":{"@id":"https://www.trustpilot.com/#/schema/DataSet/anthropic.com/1",
|
||||
"@type":"Dataset",
|
||||
"about":{"@id":"https://www.trustpilot.com/#/schema/Organization/anthropic.com"},
|
||||
"name":"Anthropic",
|
||||
"mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[
|
||||
{"csvw:name":"1 star","csvw:cells":[{"csvw:value":"196","csvw:notes":["87%"]}]},
|
||||
{"csvw:name":"2 stars","csvw:cells":[{"csvw:value":"9","csvw:notes":["4%"]}]},
|
||||
{"csvw:name":"3 stars","csvw:cells":[{"csvw:value":"5","csvw:notes":["2%"]}]},
|
||||
{"csvw:name":"4 stars","csvw:cells":[{"csvw:value":"1","csvw:notes":["0%"]}]},
|
||||
{"csvw:name":"5 stars","csvw:cells":[{"csvw:value":"15","csvw:notes":["7%"]}]},
|
||||
{"csvw:name":"Total","csvw:cells":[{"csvw:value":"226","csvw:notes":["100%"]}]}
|
||||
]}}}}
|
||||
</script>
|
||||
<script type="application/ld+json">
|
||||
{"aiSummary":{"modelVersion":"2.0.0","summary":"Mixed reviews."},
|
||||
"aiSummaryReviews":[
|
||||
{"id":"abc","rating":1,"title":"Bad","text":"Didn't work.","language":"en",
|
||||
"source":"Organic","likes":2,"consumer":{"displayName":"W.FRH","countryCode":"DE","numberOfReviews":69,"isVerified":false},
|
||||
"dates":{"experiencedDate":"2026-01-05T00:00:00.000Z","publishedDate":"2026-01-05T16:29:31.000Z"}}]}
|
||||
</script>
|
||||
</head></html>"##;
|
||||
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
|
||||
assert_eq!(v["domain"], "anthropic.com");
|
||||
assert_eq!(v["business_name"], "Anthropic");
|
||||
assert_eq!(v["rating_label"], "Bad");
|
||||
assert_eq!(v["review_count"], 226);
|
||||
assert_eq!(v["rating_distribution"]["one_star"]["count"], 196);
|
||||
assert_eq!(v["rating_distribution"]["total"]["count"], 226);
|
||||
assert_eq!(v["ai_summary"], "Mixed reviews.");
|
||||
assert_eq!(v["recent_reviews"].as_array().unwrap().len(), 1);
|
||||
assert_eq!(v["recent_reviews"][0]["author"], "W.FRH");
|
||||
assert_eq!(v["recent_reviews"][0]["rating"], 1);
|
||||
assert_eq!(v["recent_reviews"][0]["title"], "Bad");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_falls_back_to_og_when_no_jsonld() {
|
||||
let html = r#"<meta property="og:title" content="Anthropic is rated "Bad" with 1.5 / 5 on Trustpilot">
|
||||
<meta property="og:description" content="Voice your opinion today and hear what 226 customers have already said.">"#;
|
||||
let v = parse(html, "https://www.trustpilot.com/review/anthropic.com").unwrap();
|
||||
assert_eq!(v["domain"], "anthropic.com");
|
||||
assert_eq!(v["business_name"], "Anthropic");
|
||||
assert_eq!(v["average_rating"], "1.5");
|
||||
assert_eq!(v["review_count"], 226);
|
||||
assert_eq!(v["rating_label"], "Bad");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_returns_ok_with_url_domain_when_nothing_else() {
|
||||
let v = parse(
|
||||
"<html><head></head></html>",
|
||||
"https://www.trustpilot.com/review/example.com",
|
||||
)
|
||||
.unwrap();
|
||||
assert_eq!(v["domain"], "example.com");
|
||||
assert_eq!(v["business_name"], "example.com");
|
||||
}
|
||||
}
|
||||
237
crates/webclaw-fetch/src/extractors/woocommerce_product.rs
Normal file
237
crates/webclaw-fetch/src/extractors/woocommerce_product.rs
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
//! WooCommerce product structured extractor.
|
||||
//!
|
||||
//! Targets WooCommerce's Store API: `/wp-json/wc/store/v1/products?slug={slug}`.
|
||||
//! About 30-50% of WooCommerce stores expose this endpoint publicly
|
||||
//! (it's on by default, but common security plugins disable it).
|
||||
//! When it's off, the server returns 404 at /wp-json. We surface a
|
||||
//! clean error and point callers at `/v1/scrape/ecommerce_product`
|
||||
//! which works on any store with Schema.org JSON-LD.
|
||||
//!
|
||||
//! Explicit-call only. `/product/{slug}` is the default permalink for
|
||||
//! WooCommerce but custom stores use every variation imaginable, so
|
||||
//! auto-dispatch is unreliable.
|
||||
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "woocommerce_product",
|
||||
label: "WooCommerce product",
|
||||
description: "Returns product via the WooCommerce Store REST API (requires the /wp-json/wc/store endpoint to be enabled on the target store).",
|
||||
url_patterns: &[
|
||||
"https://{shop}/product/{slug}",
|
||||
"https://{shop}/shop/{slug}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
let host = host_of(url);
|
||||
if host.is_empty() {
|
||||
return false;
|
||||
}
|
||||
// Permissive: WooCommerce stores use custom domains + custom
|
||||
// permalinks. The extractor's API probe is what confirms it's
|
||||
// really WooCommerce.
|
||||
url.contains("/product/")
|
||||
|| url.contains("/shop/")
|
||||
|| url.contains("/producto/") // common es locale
|
||||
|| url.contains("/produit/") // common fr locale
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let slug = parse_slug(url).ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"woocommerce_product: cannot parse slug from '{url}'"
|
||||
))
|
||||
})?;
|
||||
let host = host_of(url);
|
||||
if host.is_empty() {
|
||||
return Err(FetchError::Build(format!(
|
||||
"woocommerce_product: empty host in '{url}'"
|
||||
)));
|
||||
}
|
||||
let scheme = if url.starts_with("http://") {
|
||||
"http"
|
||||
} else {
|
||||
"https"
|
||||
};
|
||||
let api_url = format!("{scheme}://{host}/wp-json/wc/store/v1/products?slug={slug}&per_page=1");
|
||||
let resp = client.fetch(&api_url).await?;
|
||||
if resp.status == 404 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"woocommerce_product: {host} does not expose /wp-json/wc/store (404). \
|
||||
Use /v1/scrape/ecommerce_product for JSON-LD fallback."
|
||||
)));
|
||||
}
|
||||
if resp.status == 401 || resp.status == 403 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"woocommerce_product: {host} requires auth for /wp-json/wc/store ({}). \
|
||||
Use /v1/scrape/ecommerce_product for the public JSON-LD fallback.",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"woocommerce api returned status {} for {api_url}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
let products: Vec<Product> = serde_json::from_str(&resp.html)
|
||||
.map_err(|e| FetchError::BodyDecode(format!("woocommerce parse: {e}")))?;
|
||||
let p = products.into_iter().next().ok_or_else(|| {
|
||||
FetchError::Build(format!(
|
||||
"woocommerce_product: no product found for slug '{slug}' on {host}"
|
||||
))
|
||||
})?;
|
||||
|
||||
let images: Vec<Value> = p
|
||||
.images
|
||||
.iter()
|
||||
.map(|i| json!({"src": i.src, "thumbnail": i.thumbnail, "alt": i.alt}))
|
||||
.collect();
|
||||
let variations_count = p.variations.as_ref().map(|v| v.len()).unwrap_or(0);
|
||||
|
||||
Ok(json!({
|
||||
"url": url,
|
||||
"api_url": api_url,
|
||||
"product_id": p.id,
|
||||
"name": p.name,
|
||||
"slug": p.slug,
|
||||
"sku": p.sku,
|
||||
"permalink": p.permalink,
|
||||
"on_sale": p.on_sale,
|
||||
"in_stock": p.is_in_stock,
|
||||
"is_purchasable": p.is_purchasable,
|
||||
"price": p.prices.as_ref().and_then(|pr| pr.price.clone()),
|
||||
"regular_price": p.prices.as_ref().and_then(|pr| pr.regular_price.clone()),
|
||||
"sale_price": p.prices.as_ref().and_then(|pr| pr.sale_price.clone()),
|
||||
"currency": p.prices.as_ref().and_then(|pr| pr.currency_code.clone()),
|
||||
"currency_minor": p.prices.as_ref().and_then(|pr| pr.currency_minor_unit),
|
||||
"price_range": p.prices.as_ref().and_then(|pr| pr.price_range.clone()),
|
||||
"average_rating": p.average_rating,
|
||||
"review_count": p.review_count,
|
||||
"description": p.description,
|
||||
"short_description": p.short_description,
|
||||
"categories": p.categories.iter().filter_map(|c| c.name.clone()).collect::<Vec<_>>(),
|
||||
"tags": p.tags.iter().filter_map(|t| t.name.clone()).collect::<Vec<_>>(),
|
||||
"variation_count": variations_count,
|
||||
"image_count": images.len(),
|
||||
"images": images,
|
||||
}))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn host_of(url: &str) -> &str {
|
||||
url.split("://")
|
||||
.nth(1)
|
||||
.unwrap_or(url)
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
}
|
||||
|
||||
/// Extract the product slug from common WooCommerce permalinks.
|
||||
fn parse_slug(url: &str) -> Option<String> {
|
||||
for needle in ["/product/", "/shop/", "/producto/", "/produit/"] {
|
||||
if let Some(after) = url.split(needle).nth(1) {
|
||||
let stripped = after
|
||||
.split(['?', '#'])
|
||||
.next()?
|
||||
.trim_end_matches('/')
|
||||
.split('/')
|
||||
.next()
|
||||
.unwrap_or("");
|
||||
if !stripped.is_empty() {
|
||||
return Some(stripped.to_string());
|
||||
}
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Store API types (subset of the full response)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Product {
|
||||
id: Option<i64>,
|
||||
name: Option<String>,
|
||||
slug: Option<String>,
|
||||
sku: Option<String>,
|
||||
permalink: Option<String>,
|
||||
description: Option<String>,
|
||||
short_description: Option<String>,
|
||||
on_sale: Option<bool>,
|
||||
is_in_stock: Option<bool>,
|
||||
is_purchasable: Option<bool>,
|
||||
average_rating: Option<serde_json::Value>, // string or number
|
||||
review_count: Option<i64>,
|
||||
prices: Option<Prices>,
|
||||
#[serde(default)]
|
||||
categories: Vec<Term>,
|
||||
#[serde(default)]
|
||||
tags: Vec<Term>,
|
||||
#[serde(default)]
|
||||
images: Vec<Img>,
|
||||
variations: Option<Vec<serde_json::Value>>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Prices {
|
||||
price: Option<String>,
|
||||
regular_price: Option<String>,
|
||||
sale_price: Option<String>,
|
||||
currency_code: Option<String>,
|
||||
currency_minor_unit: Option<i64>,
|
||||
price_range: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Term {
|
||||
name: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize)]
|
||||
struct Img {
|
||||
src: Option<String>,
|
||||
thumbnail: Option<String>,
|
||||
alt: Option<String>,
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_common_permalinks() {
|
||||
assert!(matches("https://shop.example.com/product/cool-widget"));
|
||||
assert!(matches("https://shop.example.com/shop/cool-widget"));
|
||||
assert!(matches("https://tienda.example.com/producto/cosa"));
|
||||
assert!(matches("https://boutique.example.com/produit/chose"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_slug_handles_locale_and_suffix() {
|
||||
assert_eq!(
|
||||
parse_slug("https://shop.example.com/product/cool-widget"),
|
||||
Some("cool-widget".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://shop.example.com/product/cool-widget/?attr=red"),
|
||||
Some("cool-widget".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_slug("https://tienda.example.com/producto/cosa/"),
|
||||
Some("cosa".into())
|
||||
);
|
||||
}
|
||||
}
|
||||
378
crates/webclaw-fetch/src/extractors/youtube_video.rs
Normal file
378
crates/webclaw-fetch/src/extractors/youtube_video.rs
Normal file
|
|
@ -0,0 +1,378 @@
|
|||
//! YouTube video structured extractor.
|
||||
//!
|
||||
//! YouTube embeds the full player configuration in a
|
||||
//! `ytInitialPlayerResponse` JavaScript assignment at the top of
|
||||
//! every `/watch`, `/shorts`, and `youtu.be` HTML page. We reuse the
|
||||
//! core crate's already-proven regex + parse to surface typed JSON
|
||||
//! from it: video id, title, author + channel id, view count,
|
||||
//! duration, upload date, keywords, thumbnails, caption-track URLs.
|
||||
//!
|
||||
//! Auto-dispatched: YouTube host is unique and the `v=` or `/shorts/`
|
||||
//! shape is stable.
|
||||
//!
|
||||
//! ## Fallback
|
||||
//!
|
||||
//! `ytInitialPlayerResponse` is missing on EU-consent interstitials,
|
||||
//! some live-stream pre-show pages, and age-gated videos. In those
|
||||
//! cases we drop down to OG tags for `title`, `description`,
|
||||
//! `thumbnail`, and `channel`, and return a `data_source:
|
||||
//! "og_fallback"` payload so the caller can tell they got a degraded
|
||||
//! shape (no view count, duration, captions).
|
||||
|
||||
use std::sync::OnceLock;
|
||||
|
||||
use regex::Regex;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
use super::ExtractorInfo;
|
||||
use crate::error::FetchError;
|
||||
use crate::fetcher::Fetcher;
|
||||
|
||||
pub const INFO: ExtractorInfo = ExtractorInfo {
|
||||
name: "youtube_video",
|
||||
label: "YouTube video",
|
||||
description: "Returns video id, title, channel, view count, duration, upload date, thumbnails, keywords, and caption-track URLs. Falls back to OG metadata on consent / age-gate pages.",
|
||||
url_patterns: &[
|
||||
"https://www.youtube.com/watch?v={id}",
|
||||
"https://youtu.be/{id}",
|
||||
"https://www.youtube.com/shorts/{id}",
|
||||
],
|
||||
};
|
||||
|
||||
pub fn matches(url: &str) -> bool {
|
||||
webclaw_core::youtube::is_youtube_url(url)
|
||||
|| url.contains("youtube.com/shorts/")
|
||||
|| url.contains("youtube-nocookie.com/embed/")
|
||||
}
|
||||
|
||||
pub async fn extract(client: &dyn Fetcher, url: &str) -> Result<Value, FetchError> {
|
||||
let video_id = parse_video_id(url).ok_or_else(|| {
|
||||
FetchError::Build(format!("youtube_video: cannot parse video id from '{url}'"))
|
||||
})?;
|
||||
|
||||
// Always fetch the canonical /watch URL. /shorts/ and youtu.be
|
||||
// sometimes serve a thinner page without the player blob.
|
||||
let canonical = format!("https://www.youtube.com/watch?v={video_id}");
|
||||
let resp = client.fetch(&canonical).await?;
|
||||
if resp.status != 200 {
|
||||
return Err(FetchError::Build(format!(
|
||||
"youtube returned status {} for {canonical}",
|
||||
resp.status
|
||||
)));
|
||||
}
|
||||
|
||||
if let Some(player) = extract_player_response(&resp.html) {
|
||||
return Ok(build_player_payload(
|
||||
&player, &resp.html, url, &canonical, &video_id,
|
||||
));
|
||||
}
|
||||
|
||||
// No player blob. Fall back to OG tags so the call still returns
|
||||
// something useful for consent / age-gate pages.
|
||||
Ok(build_og_fallback(&resp.html, url, &canonical, &video_id))
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Player-blob path (rich payload)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn build_player_payload(
|
||||
player: &Value,
|
||||
html: &str,
|
||||
url: &str,
|
||||
canonical: &str,
|
||||
video_id: &str,
|
||||
) -> Value {
|
||||
let video_details = player.get("videoDetails");
|
||||
let microformat = player
|
||||
.get("microformat")
|
||||
.and_then(|m| m.get("playerMicroformatRenderer"));
|
||||
|
||||
let thumbnails: Vec<Value> = video_details
|
||||
.and_then(|vd| vd.get("thumbnail"))
|
||||
.and_then(|t| t.get("thumbnails"))
|
||||
.and_then(|t| t.as_array())
|
||||
.cloned()
|
||||
.unwrap_or_default();
|
||||
|
||||
let keywords: Vec<Value> = video_details
|
||||
.and_then(|vd| vd.get("keywords"))
|
||||
.and_then(|k| k.as_array())
|
||||
.cloned()
|
||||
.unwrap_or_default();
|
||||
|
||||
let caption_tracks = webclaw_core::youtube::extract_caption_tracks(html);
|
||||
let captions: Vec<Value> = caption_tracks
|
||||
.iter()
|
||||
.map(|c| {
|
||||
json!({
|
||||
"url": c.url,
|
||||
"lang": c.lang,
|
||||
"name": c.name,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"canonical_url":canonical,
|
||||
"data_source": "player_response",
|
||||
"video_id": video_id,
|
||||
"title": get_str(video_details, "title"),
|
||||
"description": get_str(video_details, "shortDescription"),
|
||||
"author": get_str(video_details, "author"),
|
||||
"channel_id": get_str(video_details, "channelId"),
|
||||
"channel_url": get_str(microformat, "ownerProfileUrl"),
|
||||
"view_count": get_int(video_details, "viewCount"),
|
||||
"length_seconds": get_int(video_details, "lengthSeconds"),
|
||||
"is_live": video_details.and_then(|vd| vd.get("isLiveContent")).and_then(|v| v.as_bool()),
|
||||
"is_private": video_details.and_then(|vd| vd.get("isPrivate")).and_then(|v| v.as_bool()),
|
||||
"is_unlisted": microformat.and_then(|m| m.get("isUnlisted")).and_then(|v| v.as_bool()),
|
||||
"allow_ratings":video_details.and_then(|vd| vd.get("allowRatings")).and_then(|v| v.as_bool()),
|
||||
"category": get_str(microformat, "category"),
|
||||
"upload_date": get_str(microformat, "uploadDate"),
|
||||
"publish_date": get_str(microformat, "publishDate"),
|
||||
"keywords": keywords,
|
||||
"thumbnails": thumbnails,
|
||||
"caption_tracks": captions,
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// OG fallback path (degraded payload)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn build_og_fallback(html: &str, url: &str, canonical: &str, video_id: &str) -> Value {
|
||||
let title = og(html, "title");
|
||||
let description = og(html, "description");
|
||||
let thumbnail = og(html, "image");
|
||||
// YouTube sets `<meta name="channel_name" ...>` on some pages but
|
||||
// OG-only pages reliably carry `og:video:tag` and the channel in
|
||||
// `<link itemprop="name">`. We keep this lean: just what's stable.
|
||||
let channel = meta_name(html, "author");
|
||||
|
||||
json!({
|
||||
"url": url,
|
||||
"canonical_url":canonical,
|
||||
"data_source": "og_fallback",
|
||||
"video_id": video_id,
|
||||
"title": title,
|
||||
"description": description,
|
||||
"author": channel,
|
||||
// OG path: these are null so the caller doesn't have to guess.
|
||||
"channel_id": None::<String>,
|
||||
"channel_url": None::<String>,
|
||||
"view_count": None::<i64>,
|
||||
"length_seconds": None::<i64>,
|
||||
"is_live": None::<bool>,
|
||||
"is_private": None::<bool>,
|
||||
"is_unlisted": None::<bool>,
|
||||
"allow_ratings":None::<bool>,
|
||||
"category": None::<String>,
|
||||
"upload_date": None::<String>,
|
||||
"publish_date": None::<String>,
|
||||
"keywords": Vec::<Value>::new(),
|
||||
"thumbnails": thumbnail.as_ref().map(|t| vec![json!({"url": t})]).unwrap_or_default(),
|
||||
"caption_tracks": Vec::<Value>::new(),
|
||||
})
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// URL helpers
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn parse_video_id(url: &str) -> Option<String> {
|
||||
// youtu.be/{id}
|
||||
if let Some(after) = url.split("youtu.be/").nth(1) {
|
||||
let id = after
|
||||
.split(['?', '#', '/'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
if !id.is_empty() {
|
||||
return Some(id.to_string());
|
||||
}
|
||||
}
|
||||
// youtube.com/shorts/{id}
|
||||
if let Some(after) = url.split("youtube.com/shorts/").nth(1) {
|
||||
let id = after
|
||||
.split(['?', '#', '/'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
if !id.is_empty() {
|
||||
return Some(id.to_string());
|
||||
}
|
||||
}
|
||||
// youtube-nocookie.com/embed/{id}
|
||||
if let Some(after) = url.split("/embed/").nth(1) {
|
||||
let id = after
|
||||
.split(['?', '#', '/'])
|
||||
.next()
|
||||
.unwrap_or("")
|
||||
.trim_end_matches('/');
|
||||
if !id.is_empty() {
|
||||
return Some(id.to_string());
|
||||
}
|
||||
}
|
||||
// youtube.com/watch?v={id} (also matches youtube.com/watch?foo=bar&v={id})
|
||||
if let Some(q) = url.split_once('?').map(|(_, q)| q)
|
||||
&& let Some(id) = q
|
||||
.split('&')
|
||||
.find_map(|p| p.strip_prefix("v=").map(|v| v.to_string()))
|
||||
{
|
||||
let id = id.split(['#', '/']).next().unwrap_or(&id).to_string();
|
||||
if !id.is_empty() {
|
||||
return Some(id);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Player-response parsing
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn extract_player_response(html: &str) -> Option<Value> {
|
||||
// Same regex as webclaw_core::youtube. Duplicated here because
|
||||
// core's regex is module-private. Kept in lockstep; changes are
|
||||
// rare and we cover with tests in both places.
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE
|
||||
.get_or_init(|| Regex::new(r"var\s+ytInitialPlayerResponse\s*=\s*(\{.+?\})\s*;").unwrap());
|
||||
let json_str = re.captures(html)?.get(1)?.as_str();
|
||||
serde_json::from_str(json_str).ok()
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Meta-tag helpers (for OG fallback)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn og(html: &str, prop: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+property="og:([a-z_]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == prop) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn meta_name(html: &str, name: &str) -> Option<String> {
|
||||
static RE: OnceLock<Regex> = OnceLock::new();
|
||||
let re = RE.get_or_init(|| {
|
||||
Regex::new(r#"(?i)<meta[^>]+name="([^"]+)"[^>]+content="([^"]+)""#).unwrap()
|
||||
});
|
||||
for c in re.captures_iter(html) {
|
||||
if c.get(1).is_some_and(|m| m.as_str() == name) {
|
||||
return c.get(2).map(|m| m.as_str().to_string());
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn get_str(v: Option<&Value>, key: &str) -> Option<String> {
|
||||
v.and_then(|x| x.get(key))
|
||||
.and_then(|x| x.as_str().map(String::from))
|
||||
}
|
||||
|
||||
fn get_int(v: Option<&Value>, key: &str) -> Option<i64> {
|
||||
v.and_then(|x| x.get(key)).and_then(|x| {
|
||||
x.as_i64()
|
||||
.or_else(|| x.as_str().and_then(|s| s.parse::<i64>().ok()))
|
||||
})
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn matches_watch_urls() {
|
||||
assert!(matches("https://www.youtube.com/watch?v=dQw4w9WgXcQ"));
|
||||
assert!(matches("https://youtu.be/dQw4w9WgXcQ"));
|
||||
assert!(matches("https://www.youtube.com/shorts/abc123"));
|
||||
assert!(matches(
|
||||
"https://www.youtube-nocookie.com/embed/dQw4w9WgXcQ"
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn rejects_non_video_urls() {
|
||||
assert!(!matches("https://www.youtube.com/"));
|
||||
assert!(!matches("https://www.youtube.com/channel/abc"));
|
||||
assert!(!matches("https://example.com/watch?v=abc"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn parse_video_id_from_each_shape() {
|
||||
assert_eq!(
|
||||
parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ"),
|
||||
Some("dQw4w9WgXcQ".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_video_id("https://www.youtube.com/watch?v=dQw4w9WgXcQ&t=10s"),
|
||||
Some("dQw4w9WgXcQ".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_video_id("https://www.youtube.com/watch?feature=share&v=dQw4w9WgXcQ"),
|
||||
Some("dQw4w9WgXcQ".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_video_id("https://youtu.be/dQw4w9WgXcQ"),
|
||||
Some("dQw4w9WgXcQ".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_video_id("https://youtu.be/dQw4w9WgXcQ?t=30"),
|
||||
Some("dQw4w9WgXcQ".into())
|
||||
);
|
||||
assert_eq!(
|
||||
parse_video_id("https://www.youtube.com/shorts/abc123"),
|
||||
Some("abc123".into())
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn extract_player_response_happy_path() {
|
||||
let html = r#"
|
||||
<html><body>
|
||||
<script>
|
||||
var ytInitialPlayerResponse = {"videoDetails":{"videoId":"abc","title":"T","author":"A","viewCount":"100","lengthSeconds":"60","shortDescription":"d"}};
|
||||
</script>
|
||||
</body></html>
|
||||
"#;
|
||||
let v = extract_player_response(html).unwrap();
|
||||
let vd = v.get("videoDetails").unwrap();
|
||||
assert_eq!(vd.get("title").unwrap().as_str(), Some("T"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn og_fallback_extracts_basics_from_meta_tags() {
|
||||
let html = r##"
|
||||
<html><head>
|
||||
<meta property="og:title" content="Example Video Title">
|
||||
<meta property="og:description" content="A cool video description.">
|
||||
<meta property="og:image" content="https://i.ytimg.com/vi/abc/maxresdefault.jpg">
|
||||
<meta name="author" content="Example Channel">
|
||||
</head></html>"##;
|
||||
let v = build_og_fallback(
|
||||
html,
|
||||
"https://www.youtube.com/watch?v=abc",
|
||||
"https://www.youtube.com/watch?v=abc",
|
||||
"abc",
|
||||
);
|
||||
assert_eq!(v["data_source"], "og_fallback");
|
||||
assert_eq!(v["title"], "Example Video Title");
|
||||
assert_eq!(v["description"], "A cool video description.");
|
||||
assert_eq!(v["author"], "Example Channel");
|
||||
assert_eq!(
|
||||
v["thumbnails"][0]["url"],
|
||||
"https://i.ytimg.com/vi/abc/maxresdefault.jpg"
|
||||
);
|
||||
assert!(v["view_count"].is_null());
|
||||
assert!(v["caption_tracks"].as_array().unwrap().is_empty());
|
||||
}
|
||||
}
|
||||
118
crates/webclaw-fetch/src/fetcher.rs
Normal file
118
crates/webclaw-fetch/src/fetcher.rs
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
//! Pluggable fetcher abstraction for vertical extractors.
|
||||
//!
|
||||
//! Extractors call the network through this trait instead of hard-
|
||||
//! coding [`FetchClient`]. The OSS CLI / MCP / self-hosted server all
|
||||
//! pass `&FetchClient` (wreq-backed BoringSSL). The production API
|
||||
//! server, which must not use in-process TLS fingerprinting, provides
|
||||
//! its own implementation that routes through the Go tls-sidecar.
|
||||
//!
|
||||
//! Both paths expose the same [`FetchResult`] shape and the same
|
||||
//! optional cloud-escalation client, so extractor logic stays
|
||||
//! identical across environments.
|
||||
//!
|
||||
//! ## Choosing an implementation
|
||||
//!
|
||||
//! - CLI, MCP, self-hosted `webclaw-server`: build a [`FetchClient`]
|
||||
//! with [`FetchClient::with_cloud`] to attach cloud fallback, pass
|
||||
//! it to extractors as `&client`.
|
||||
//! - `api.webclaw.io` production server: build a `TlsSidecarFetcher`
|
||||
//! (in `server/src/engine/`) that delegates to `engine::tls_client`
|
||||
//! and wraps it in `Arc<dyn Fetcher>` for handler injection.
|
||||
//!
|
||||
//! ## Why a trait and not a free function
|
||||
//!
|
||||
//! Extractors need state beyond a single fetch: the cloud client for
|
||||
//! antibot escalation, and in the future per-user proxy pools, tenant
|
||||
//! headers, circuit breakers. A trait keeps that state encapsulated
|
||||
//! behind the fetch interface instead of threading it through every
|
||||
//! extractor signature.
|
||||
|
||||
use async_trait::async_trait;
|
||||
|
||||
use crate::client::FetchResult;
|
||||
use crate::cloud::CloudClient;
|
||||
use crate::error::FetchError;
|
||||
|
||||
/// HTTP fetch surface used by vertical extractors.
|
||||
///
|
||||
/// Implementations must be `Send + Sync` because extractor dispatchers
|
||||
/// run them inside tokio tasks, potentially across many requests.
|
||||
#[async_trait]
|
||||
pub trait Fetcher: Send + Sync {
|
||||
/// Fetch a URL and return the raw response body + metadata. The
|
||||
/// body is in `FetchResult::html` regardless of the actual content
|
||||
/// type — JSON API endpoints put JSON there, HTML pages put HTML.
|
||||
/// Extractors branch on response status and body shape.
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError>;
|
||||
|
||||
/// Fetch with additional request headers. Needed for endpoints
|
||||
/// that authenticate via a specific header (Instagram's
|
||||
/// `x-ig-app-id`, for example). Default implementation routes to
|
||||
/// [`Self::fetch`] so implementers without header support stay
|
||||
/// functional, though the `Option<String>` field they'd set won't
|
||||
/// be populated on the request.
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
_headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
self.fetch(url).await
|
||||
}
|
||||
|
||||
/// Optional cloud-escalation client for antibot bypass. Returning
|
||||
/// `Some` tells extractors they can call into the hosted API when
|
||||
/// local fetch hits a challenge page. Returning `None` makes
|
||||
/// cloud-gated extractors emit [`CloudError::NotConfigured`] with
|
||||
/// an actionable signup link.
|
||||
///
|
||||
/// The default implementation returns `None` because not every
|
||||
/// deployment wants cloud fallback (self-hosts that don't have a
|
||||
/// webclaw.io subscription, for instance).
|
||||
///
|
||||
/// [`CloudError::NotConfigured`]: crate::cloud::CloudError::NotConfigured
|
||||
fn cloud(&self) -> Option<&CloudClient> {
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Blanket impls: make `&T` and `Arc<T>` behave like the wrapped `T`.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Fetcher + ?Sized> Fetcher for &T {
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch(url).await
|
||||
}
|
||||
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch_with_headers(url, headers).await
|
||||
}
|
||||
|
||||
fn cloud(&self) -> Option<&CloudClient> {
|
||||
(**self).cloud()
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl<T: Fetcher + ?Sized> Fetcher for std::sync::Arc<T> {
|
||||
async fn fetch(&self, url: &str) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch(url).await
|
||||
}
|
||||
|
||||
async fn fetch_with_headers(
|
||||
&self,
|
||||
url: &str,
|
||||
headers: &[(&str, &str)],
|
||||
) -> Result<FetchResult, FetchError> {
|
||||
(**self).fetch_with_headers(url, headers).await
|
||||
}
|
||||
|
||||
fn cloud(&self) -> Option<&CloudClient> {
|
||||
(**self).cloud()
|
||||
}
|
||||
}
|
||||
|
|
@ -3,10 +3,14 @@
|
|||
//! Automatically detects PDF responses and delegates to webclaw-pdf.
|
||||
pub mod browser;
|
||||
pub mod client;
|
||||
pub mod cloud;
|
||||
pub mod crawler;
|
||||
pub mod document;
|
||||
pub mod error;
|
||||
pub mod extractors;
|
||||
pub mod fetcher;
|
||||
pub mod linkedin;
|
||||
pub mod locale;
|
||||
pub mod proxy;
|
||||
pub mod reddit;
|
||||
pub mod sitemap;
|
||||
|
|
@ -16,7 +20,9 @@ pub use browser::BrowserProfile;
|
|||
pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
|
||||
pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
|
||||
pub use error::FetchError;
|
||||
pub use fetcher::Fetcher;
|
||||
pub use http::HeaderMap;
|
||||
pub use locale::{accept_language_for_tld, accept_language_for_url};
|
||||
pub use proxy::{parse_proxy_file, parse_proxy_line};
|
||||
pub use sitemap::SitemapEntry;
|
||||
pub use webclaw_pdf::PdfMode;
|
||||
|
|
|
|||
77
crates/webclaw-fetch/src/locale.rs
Normal file
77
crates/webclaw-fetch/src/locale.rs
Normal file
|
|
@ -0,0 +1,77 @@
|
|||
//! Derive an `Accept-Language` header from a URL.
|
||||
//!
|
||||
//! DataDome-class bot detection on country-specific sites (e.g. immobiliare.it,
|
||||
//! leboncoin.fr) does a geo-vs-locale sanity check: residential IP in the
|
||||
//! target country + a browser UA but the wrong `Accept-Language` is a bot
|
||||
//! signal. Matching the site's expected locale gets us through.
|
||||
//!
|
||||
//! Default for unmapped TLDs is `en-US,en;q=0.9` — the global fallback.
|
||||
|
||||
/// Best-effort `Accept-Language` header value for the given URL's TLD.
|
||||
/// Returns `None` if the URL cannot be parsed.
|
||||
pub fn accept_language_for_url(url: &str) -> Option<&'static str> {
|
||||
let host = url::Url::parse(url).ok()?.host_str()?.to_ascii_lowercase();
|
||||
let tld = host.rsplit('.').next()?;
|
||||
Some(accept_language_for_tld(tld))
|
||||
}
|
||||
|
||||
/// Map a bare TLD like `it`, `fr`, `de` to a plausible `Accept-Language`.
|
||||
/// Unknown TLDs fall back to US English.
|
||||
pub fn accept_language_for_tld(tld: &str) -> &'static str {
|
||||
match tld {
|
||||
"it" => "it-IT,it;q=0.9",
|
||||
"fr" => "fr-FR,fr;q=0.9",
|
||||
"de" | "at" => "de-DE,de;q=0.9",
|
||||
"es" => "es-ES,es;q=0.9",
|
||||
"pt" => "pt-PT,pt;q=0.9",
|
||||
"nl" => "nl-NL,nl;q=0.9",
|
||||
"pl" => "pl-PL,pl;q=0.9",
|
||||
"se" => "sv-SE,sv;q=0.9",
|
||||
"no" => "nb-NO,nb;q=0.9",
|
||||
"dk" => "da-DK,da;q=0.9",
|
||||
"fi" => "fi-FI,fi;q=0.9",
|
||||
"cz" => "cs-CZ,cs;q=0.9",
|
||||
"ro" => "ro-RO,ro;q=0.9",
|
||||
"gr" => "el-GR,el;q=0.9",
|
||||
"tr" => "tr-TR,tr;q=0.9",
|
||||
"ru" => "ru-RU,ru;q=0.9",
|
||||
"jp" => "ja-JP,ja;q=0.9",
|
||||
"kr" => "ko-KR,ko;q=0.9",
|
||||
"cn" => "zh-CN,zh;q=0.9",
|
||||
"tw" | "hk" => "zh-TW,zh;q=0.9",
|
||||
"br" => "pt-BR,pt;q=0.9",
|
||||
"mx" | "ar" | "co" | "cl" | "pe" => "es-ES,es;q=0.9",
|
||||
"uk" | "ie" => "en-GB,en;q=0.9",
|
||||
_ => "en-US,en;q=0.9",
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn tld_dispatch() {
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://www.immobiliare.it/annunci/1"),
|
||||
Some("it-IT,it;q=0.9")
|
||||
);
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://www.leboncoin.fr/"),
|
||||
Some("fr-FR,fr;q=0.9")
|
||||
);
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://www.amazon.co.uk/"),
|
||||
Some("en-GB,en;q=0.9")
|
||||
);
|
||||
assert_eq!(
|
||||
accept_language_for_url("https://example.com/"),
|
||||
Some("en-US,en;q=0.9")
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn bad_url_returns_none() {
|
||||
assert_eq!(accept_language_for_url("not-a-url"), None);
|
||||
}
|
||||
}
|
||||
|
|
@ -7,10 +7,15 @@
|
|||
|
||||
use std::time::Duration;
|
||||
|
||||
use std::borrow::Cow;
|
||||
|
||||
use wreq::http2::{
|
||||
Http2Options, PseudoId, PseudoOrder, SettingId, SettingsOrder, StreamDependency, StreamId,
|
||||
};
|
||||
use wreq::tls::{AlpsProtocol, CertificateCompressionAlgorithm, TlsOptions, TlsVersion};
|
||||
use wreq::tls::{
|
||||
AlpnProtocol, AlpsProtocol, CertificateCompressionAlgorithm, ExtensionType, TlsOptions,
|
||||
TlsVersion,
|
||||
};
|
||||
use wreq::{Client, Emulation};
|
||||
|
||||
use crate::browser::BrowserVariant;
|
||||
|
|
@ -43,6 +48,55 @@ const SAFARI_SIGALGS: &str = "ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkc
|
|||
/// Safari curves.
|
||||
const SAFARI_CURVES: &str = "X25519:P-256:P-384:P-521";
|
||||
|
||||
/// Safari iOS 26 TLS extension order, matching bogdanfinn's
|
||||
/// `safari_ios_26_0` wire format. GREASE slots are omitted. wreq
|
||||
/// inserts them itself. Diverges from wreq-util's default SafariIos26
|
||||
/// extension order, which DataDome's immobiliare.it ruleset flags.
|
||||
fn safari_ios_extensions() -> Vec<ExtensionType> {
|
||||
vec![
|
||||
ExtensionType::CERTIFICATE_TIMESTAMP,
|
||||
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION,
|
||||
ExtensionType::SERVER_NAME,
|
||||
ExtensionType::CERT_COMPRESSION,
|
||||
ExtensionType::KEY_SHARE,
|
||||
ExtensionType::SUPPORTED_VERSIONS,
|
||||
ExtensionType::PSK_KEY_EXCHANGE_MODES,
|
||||
ExtensionType::SUPPORTED_GROUPS,
|
||||
ExtensionType::RENEGOTIATE,
|
||||
ExtensionType::SIGNATURE_ALGORITHMS,
|
||||
ExtensionType::STATUS_REQUEST,
|
||||
ExtensionType::EC_POINT_FORMATS,
|
||||
ExtensionType::EXTENDED_MASTER_SECRET,
|
||||
]
|
||||
}
|
||||
|
||||
/// Chrome 133 TLS extension order, matching bogdanfinn's stable JA3
|
||||
/// (`43067709b025da334de1279a120f8e14`). Real Chrome permutes extensions
|
||||
/// per handshake, but indeed.com's WAF allowlists this specific wire order
|
||||
/// and rejects permuted ones. GREASE slots are inserted by wreq.
|
||||
///
|
||||
/// JA3 extension field from peet.ws: 18-5-35-51-10-45-11-27-17613-43-13-0-16-65037-65281-23
|
||||
fn chrome_extensions() -> Vec<ExtensionType> {
|
||||
vec![
|
||||
ExtensionType::CERTIFICATE_TIMESTAMP, // 18
|
||||
ExtensionType::STATUS_REQUEST, // 5
|
||||
ExtensionType::SESSION_TICKET, // 35
|
||||
ExtensionType::KEY_SHARE, // 51
|
||||
ExtensionType::SUPPORTED_GROUPS, // 10
|
||||
ExtensionType::PSK_KEY_EXCHANGE_MODES, // 45
|
||||
ExtensionType::EC_POINT_FORMATS, // 11
|
||||
ExtensionType::CERT_COMPRESSION, // 27
|
||||
ExtensionType::APPLICATION_SETTINGS_NEW, // 17613 (new codepoint, matches alps_use_new_codepoint)
|
||||
ExtensionType::SUPPORTED_VERSIONS, // 43
|
||||
ExtensionType::SIGNATURE_ALGORITHMS, // 13
|
||||
ExtensionType::SERVER_NAME, // 0
|
||||
ExtensionType::APPLICATION_LAYER_PROTOCOL_NEGOTIATION, // 16
|
||||
ExtensionType::ENCRYPTED_CLIENT_HELLO, // 65037
|
||||
ExtensionType::RENEGOTIATE, // 65281
|
||||
ExtensionType::EXTENDED_MASTER_SECRET, // 23
|
||||
]
|
||||
}
|
||||
|
||||
// --- Chrome HTTP headers in correct wire order ---
|
||||
|
||||
const CHROME_HEADERS: &[(&str, &str)] = &[
|
||||
|
|
@ -130,6 +184,26 @@ const SAFARI_HEADERS: &[(&str, &str)] = &[
|
|||
("sec-fetch-dest", "document"),
|
||||
];
|
||||
|
||||
/// Safari iOS 26 headers, in the wire order real Safari emits. Critically:
|
||||
/// NO `sec-fetch-*`, NO `priority: u=0, i` (both Chromium-only leaks), but
|
||||
/// `upgrade-insecure-requests: 1` is present. `accept-encoding` does not
|
||||
/// include zstd (Safari can't decode it). Verified against bogdanfinn on
|
||||
/// 2026-04-22: this header set is what DataDome's immobiliare ruleset
|
||||
/// expects for a real iPhone.
|
||||
const SAFARI_IOS_HEADERS: &[(&str, &str)] = &[
|
||||
(
|
||||
"accept",
|
||||
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
|
||||
),
|
||||
("accept-language", "en-US,en;q=0.9"),
|
||||
("accept-encoding", "gzip, deflate, br"),
|
||||
(
|
||||
"user-agent",
|
||||
"Mozilla/5.0 (iPhone; CPU iPhone OS 26_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1",
|
||||
),
|
||||
("upgrade-insecure-requests", "1"),
|
||||
];
|
||||
|
||||
const EDGE_HEADERS: &[(&str, &str)] = &[
|
||||
(
|
||||
"sec-ch-ua",
|
||||
|
|
@ -156,6 +230,9 @@ const EDGE_HEADERS: &[(&str, &str)] = &[
|
|||
];
|
||||
|
||||
fn chrome_tls() -> TlsOptions {
|
||||
// permute_extensions is off so the explicit extension_permutation sticks.
|
||||
// Real Chrome permutes, but indeed.com's WAF allowlists bogdanfinn's
|
||||
// fixed order, so matching that gets us through.
|
||||
TlsOptions::builder()
|
||||
.cipher_list(CHROME_CIPHERS)
|
||||
.sigalgs_list(CHROME_SIGALGS)
|
||||
|
|
@ -163,12 +240,18 @@ fn chrome_tls() -> TlsOptions {
|
|||
.min_tls_version(TlsVersion::TLS_1_2)
|
||||
.max_tls_version(TlsVersion::TLS_1_3)
|
||||
.grease_enabled(true)
|
||||
.permute_extensions(true)
|
||||
.permute_extensions(false)
|
||||
.extension_permutation(chrome_extensions())
|
||||
.enable_ech_grease(true)
|
||||
.pre_shared_key(true)
|
||||
.enable_ocsp_stapling(true)
|
||||
.enable_signed_cert_timestamps(true)
|
||||
.alps_protocols([AlpsProtocol::HTTP2])
|
||||
.alpn_protocols([
|
||||
AlpnProtocol::HTTP3,
|
||||
AlpnProtocol::HTTP2,
|
||||
AlpnProtocol::HTTP1,
|
||||
])
|
||||
.alps_protocols([AlpsProtocol::HTTP3, AlpsProtocol::HTTP2])
|
||||
.alps_use_new_codepoint(true)
|
||||
.aes_hw_override(true)
|
||||
.certificate_compression_algorithms(&[CertificateCompressionAlgorithm::BROTLI])
|
||||
|
|
@ -212,25 +295,70 @@ fn safari_tls() -> TlsOptions {
|
|||
.build()
|
||||
}
|
||||
|
||||
/// Safari iOS 26 emulation — composed on top of `wreq_util::Emulation::SafariIos26`
|
||||
/// with four targeted overrides. We don't hand-roll this one like Chrome/Firefox
|
||||
/// because the wire-level defaults from wreq-util are already correct for ciphers,
|
||||
/// sigalgs, curves, and GREASE — the four things wreq-util gets *wrong* for
|
||||
/// DataDome compatibility are overridden here:
|
||||
///
|
||||
/// 1. TLS extension order: match bogdanfinn `safari_ios_26_0` exactly (JA3
|
||||
/// ends up `8d909525bd5bbb79f133d11cc05159fe`).
|
||||
/// 2. HTTP/2 HEADERS priority flag: weight=256, exclusive=1, depends_on=0.
|
||||
/// wreq-util omits this frame; real Safari and bogdanfinn include it.
|
||||
/// This flip is the thing DataDome actually reads — the akamai_fingerprint
|
||||
/// hash changes from `c52879e43202aeb92740be6e8c86ea96` to
|
||||
/// `d1294410a06522e37a5c5e3f0a45a705`, which is the winning signature.
|
||||
/// 3. Headers: strip wreq-util's Chromium defaults (`sec-fetch-*`,
|
||||
/// `priority: u=0, i`, zstd), replace with the real iOS 26 set.
|
||||
/// 4. `accept-language` preserved from config.extra_headers for locale.
|
||||
fn safari_ios_emulation() -> wreq::Emulation {
|
||||
use wreq::EmulationFactory;
|
||||
let mut em = wreq_util::Emulation::SafariIos26.emulation();
|
||||
|
||||
if let Some(tls) = em.tls_options_mut().as_mut() {
|
||||
tls.extension_permutation = Some(Cow::Owned(safari_ios_extensions()));
|
||||
}
|
||||
|
||||
// Only override the priority flag — keep wreq-util's SETTINGS, WINDOW_UPDATE,
|
||||
// and pseudo-order intact. Replacing the whole Http2Options resets SETTINGS
|
||||
// to defaults, which sends only INITIAL_WINDOW_SIZE and fails DataDome.
|
||||
if let Some(h2) = em.http2_options_mut().as_mut() {
|
||||
h2.headers_stream_dependency = Some(StreamDependency::new(StreamId::zero(), 255, true));
|
||||
}
|
||||
|
||||
let hm = em.headers_mut();
|
||||
hm.clear();
|
||||
for (k, v) in SAFARI_IOS_HEADERS {
|
||||
if let (Ok(n), Ok(val)) = (
|
||||
http::header::HeaderName::from_bytes(k.as_bytes()),
|
||||
http::header::HeaderValue::from_str(v),
|
||||
) {
|
||||
hm.append(n, val);
|
||||
}
|
||||
}
|
||||
|
||||
em
|
||||
}
|
||||
|
||||
fn chrome_h2() -> Http2Options {
|
||||
// SETTINGS frame matches bogdanfinn `chrome_133`: HEADER_TABLE_SIZE,
|
||||
// ENABLE_PUSH=0, INITIAL_WINDOW_SIZE, MAX_HEADER_LIST_SIZE. No
|
||||
// MAX_CONCURRENT_STREAMS — real Chrome 133 and bogdanfinn both omit it,
|
||||
// and indeed.com's WAF reads this as a bot signal when present. Priority
|
||||
// weight 256 (encoded as 255 + 1) matches bogdanfinn's HEADERS frame.
|
||||
Http2Options::builder()
|
||||
.initial_window_size(6_291_456)
|
||||
.initial_connection_window_size(15_728_640)
|
||||
.max_header_list_size(262_144)
|
||||
.header_table_size(65_536)
|
||||
.max_concurrent_streams(1000u32)
|
||||
.enable_push(false)
|
||||
.settings_order(
|
||||
SettingsOrder::builder()
|
||||
.extend([
|
||||
SettingId::HeaderTableSize,
|
||||
SettingId::EnablePush,
|
||||
SettingId::MaxConcurrentStreams,
|
||||
SettingId::InitialWindowSize,
|
||||
SettingId::MaxFrameSize,
|
||||
SettingId::MaxHeaderListSize,
|
||||
SettingId::EnableConnectProtocol,
|
||||
SettingId::NoRfc7540Priorities,
|
||||
])
|
||||
.build(),
|
||||
)
|
||||
|
|
@ -244,7 +372,7 @@ fn chrome_h2() -> Http2Options {
|
|||
])
|
||||
.build(),
|
||||
)
|
||||
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 219, true))
|
||||
.headers_stream_dependency(StreamDependency::new(StreamId::zero(), 255, true))
|
||||
.build()
|
||||
}
|
||||
|
||||
|
|
@ -328,32 +456,38 @@ pub fn build_client(
|
|||
extra_headers: &std::collections::HashMap<String, String>,
|
||||
proxy: Option<&str>,
|
||||
) -> Result<Client, FetchError> {
|
||||
let (tls, h2, headers) = match variant {
|
||||
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
|
||||
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
|
||||
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
|
||||
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
|
||||
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
|
||||
// SafariIos26 builds its Emulation on top of wreq-util's base instead
|
||||
// of from scratch. See `safari_ios_emulation` for why.
|
||||
let mut emulation = match variant {
|
||||
BrowserVariant::SafariIos26 => safari_ios_emulation(),
|
||||
other => {
|
||||
let (tls, h2, headers) = match other {
|
||||
BrowserVariant::Chrome => (chrome_tls(), chrome_h2(), CHROME_HEADERS),
|
||||
BrowserVariant::ChromeMacos => (chrome_tls(), chrome_h2(), CHROME_MACOS_HEADERS),
|
||||
BrowserVariant::Firefox => (firefox_tls(), firefox_h2(), FIREFOX_HEADERS),
|
||||
BrowserVariant::Safari => (safari_tls(), safari_h2(), SAFARI_HEADERS),
|
||||
BrowserVariant::Edge => (chrome_tls(), chrome_h2(), EDGE_HEADERS),
|
||||
BrowserVariant::SafariIos26 => unreachable!("handled above"),
|
||||
};
|
||||
Emulation::builder()
|
||||
.tls_options(tls)
|
||||
.http2_options(h2)
|
||||
.headers(build_headers(headers))
|
||||
.build()
|
||||
}
|
||||
};
|
||||
|
||||
let mut header_map = build_headers(headers);
|
||||
|
||||
// Append extra headers after profile defaults
|
||||
// Append extra headers after profile defaults.
|
||||
let hm = emulation.headers_mut();
|
||||
for (k, v) in extra_headers {
|
||||
if let (Ok(n), Ok(val)) = (
|
||||
http::header::HeaderName::from_bytes(k.as_bytes()),
|
||||
http::header::HeaderValue::from_str(v),
|
||||
) {
|
||||
header_map.insert(n, val);
|
||||
hm.insert(n, val);
|
||||
}
|
||||
}
|
||||
|
||||
let emulation = Emulation::builder()
|
||||
.tls_options(tls)
|
||||
.http2_options(h2)
|
||||
.headers(header_map)
|
||||
.build();
|
||||
|
||||
let mut builder = Client::builder()
|
||||
.emulation(emulation)
|
||||
.redirect(wreq::redirect::Policy::limited(10))
|
||||
|
|
|
|||
|
|
@ -22,6 +22,5 @@ serde_json = { workspace = true }
|
|||
tokio = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
tracing-subscriber = { workspace = true }
|
||||
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
|
||||
url = "2"
|
||||
dirs = "6.0.0"
|
||||
|
|
|
|||
|
|
@ -1,302 +0,0 @@
|
|||
/// Cloud API fallback for protected sites.
|
||||
///
|
||||
/// When local fetch returns a challenge page, this module retries
|
||||
/// via api.webclaw.io. Requires WEBCLAW_API_KEY to be set.
|
||||
use std::time::Duration;
|
||||
|
||||
use serde_json::{Value, json};
|
||||
use tracing::info;
|
||||
|
||||
const API_BASE: &str = "https://api.webclaw.io/v1";
|
||||
|
||||
/// Lightweight client for the webclaw cloud API.
|
||||
pub struct CloudClient {
|
||||
api_key: String,
|
||||
http: reqwest::Client,
|
||||
}
|
||||
|
||||
impl CloudClient {
|
||||
/// Create a new cloud client from WEBCLAW_API_KEY env var.
|
||||
/// Returns None if the key is not set.
|
||||
pub fn from_env() -> Option<Self> {
|
||||
let key = std::env::var("WEBCLAW_API_KEY").ok()?;
|
||||
if key.is_empty() {
|
||||
return None;
|
||||
}
|
||||
let http = reqwest::Client::builder()
|
||||
.timeout(Duration::from_secs(60))
|
||||
.build()
|
||||
.unwrap_or_default();
|
||||
Some(Self { api_key: key, http })
|
||||
}
|
||||
|
||||
/// Scrape a URL via the cloud API. Returns the response JSON.
|
||||
pub async fn scrape(
|
||||
&self,
|
||||
url: &str,
|
||||
formats: &[&str],
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
) -> Result<Value, String> {
|
||||
let mut body = json!({
|
||||
"url": url,
|
||||
"formats": formats,
|
||||
});
|
||||
|
||||
if only_main_content {
|
||||
body["only_main_content"] = json!(true);
|
||||
}
|
||||
if !include_selectors.is_empty() {
|
||||
body["include_selectors"] = json!(include_selectors);
|
||||
}
|
||||
if !exclude_selectors.is_empty() {
|
||||
body["exclude_selectors"] = json!(exclude_selectors);
|
||||
}
|
||||
|
||||
self.post("scrape", body).await
|
||||
}
|
||||
|
||||
/// Generic POST to the cloud API.
|
||||
pub async fn post(&self, endpoint: &str, body: Value) -> Result<Value, String> {
|
||||
let resp = self
|
||||
.http
|
||||
.post(format!("{API_BASE}/{endpoint}"))
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.json(&body)
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API request failed: {e}"))?;
|
||||
|
||||
let status = resp.status();
|
||||
if !status.is_success() {
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
let truncated = truncate_error(&text);
|
||||
return Err(format!("Cloud API error {status}: {truncated}"));
|
||||
}
|
||||
|
||||
resp.json::<Value>()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API response parse failed: {e}"))
|
||||
}
|
||||
|
||||
/// Generic GET from the cloud API.
|
||||
pub async fn get(&self, endpoint: &str) -> Result<Value, String> {
|
||||
let resp = self
|
||||
.http
|
||||
.get(format!("{API_BASE}/{endpoint}"))
|
||||
.header("Authorization", format!("Bearer {}", self.api_key))
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API request failed: {e}"))?;
|
||||
|
||||
let status = resp.status();
|
||||
if !status.is_success() {
|
||||
let text = resp.text().await.unwrap_or_default();
|
||||
let truncated = truncate_error(&text);
|
||||
return Err(format!("Cloud API error {status}: {truncated}"));
|
||||
}
|
||||
|
||||
resp.json::<Value>()
|
||||
.await
|
||||
.map_err(|e| format!("Cloud API response parse failed: {e}"))
|
||||
}
|
||||
}
|
||||
|
||||
/// Truncate error body to avoid flooding logs with huge HTML responses.
|
||||
fn truncate_error(text: &str) -> &str {
|
||||
const MAX_LEN: usize = 500;
|
||||
match text.char_indices().nth(MAX_LEN) {
|
||||
Some((byte_pos, _)) => &text[..byte_pos],
|
||||
None => text,
|
||||
}
|
||||
}
|
||||
|
||||
/// Check if fetched HTML looks like a bot protection challenge page.
|
||||
/// Detects common bot protection challenge pages.
|
||||
pub fn is_bot_protected(html: &str, headers: &webclaw_fetch::HeaderMap) -> bool {
|
||||
let html_lower = html.to_lowercase();
|
||||
|
||||
// Cloudflare challenge page
|
||||
if html_lower.contains("_cf_chl_opt") || html_lower.contains("challenge-platform") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare "checking your browser" spinner
|
||||
if (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
&& html_lower.contains("cf-spinner")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare Turnstile (only on short pages = challenge, not embedded on real content)
|
||||
if (html_lower.contains("cf-turnstile")
|
||||
|| html_lower.contains("challenges.cloudflare.com/turnstile"))
|
||||
&& html.len() < 100_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// DataDome
|
||||
if html_lower.contains("geo.captcha-delivery.com")
|
||||
|| html_lower.contains("captcha-delivery.com/captcha")
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// AWS WAF
|
||||
if html_lower.contains("awswaf-captcha") || html_lower.contains("aws-waf-client-browser") {
|
||||
return true;
|
||||
}
|
||||
|
||||
// hCaptcha blocking page
|
||||
if html_lower.contains("hcaptcha.com")
|
||||
&& html_lower.contains("h-captcha")
|
||||
&& html.len() < 50_000
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// Cloudflare via headers + challenge body
|
||||
let has_cf_headers = headers.get("cf-ray").is_some() || headers.get("cf-mitigated").is_some();
|
||||
if has_cf_headers
|
||||
&& (html_lower.contains("just a moment") || html_lower.contains("checking your browser"))
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Check if a page likely needs JS rendering (SPA with almost no text content).
|
||||
pub fn needs_js_rendering(word_count: usize, html: &str) -> bool {
|
||||
let has_scripts = html.contains("<script");
|
||||
|
||||
// Tier 1: almost no extractable text from a large page
|
||||
if word_count < 50 && html.len() > 5_000 && has_scripts {
|
||||
return true;
|
||||
}
|
||||
|
||||
// Tier 2: SPA framework detected with suspiciously low content-to-HTML ratio
|
||||
if word_count < 800 && html.len() > 50_000 && has_scripts {
|
||||
let html_lower = html.to_lowercase();
|
||||
let has_spa_marker = html_lower.contains("react-app")
|
||||
|| html_lower.contains("id=\"__next\"")
|
||||
|| html_lower.contains("id=\"root\"")
|
||||
|| html_lower.contains("id=\"app\"")
|
||||
|| html_lower.contains("__next_data__")
|
||||
|| html_lower.contains("nuxt")
|
||||
|| html_lower.contains("ng-app");
|
||||
|
||||
if has_spa_marker {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
false
|
||||
}
|
||||
|
||||
/// Result of a smart fetch: either local extraction or cloud API response.
|
||||
pub enum SmartFetchResult {
|
||||
/// Successfully extracted locally.
|
||||
Local(Box<webclaw_core::ExtractionResult>),
|
||||
/// Fell back to cloud API. Contains the API response JSON.
|
||||
Cloud(Value),
|
||||
}
|
||||
|
||||
/// Try local fetch first, fall back to cloud API if bot-protected or JS-rendered.
|
||||
///
|
||||
/// Returns the extraction result (local) or the cloud API response JSON.
|
||||
/// If no API key is configured and local fetch is blocked, returns an error
|
||||
/// with a helpful message.
|
||||
pub async fn smart_fetch(
|
||||
client: &webclaw_fetch::FetchClient,
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
formats: &[&str],
|
||||
) -> Result<SmartFetchResult, String> {
|
||||
// Step 1: Try local fetch (with timeout to avoid hanging on slow servers)
|
||||
let fetch_result = tokio::time::timeout(Duration::from_secs(30), client.fetch(url))
|
||||
.await
|
||||
.map_err(|_| format!("Fetch timed out after 30s for {url}"))?
|
||||
.map_err(|e| format!("Fetch failed: {e}"))?;
|
||||
|
||||
// Step 2: Check for bot protection
|
||||
if is_bot_protected(&fetch_result.html, &fetch_result.headers) {
|
||||
info!(url, "bot protection detected, falling back to cloud API");
|
||||
return cloud_fallback(
|
||||
cloud,
|
||||
url,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
formats,
|
||||
)
|
||||
.await;
|
||||
}
|
||||
|
||||
// Step 3: Extract locally
|
||||
let options = webclaw_core::ExtractionOptions {
|
||||
include_selectors: include_selectors.to_vec(),
|
||||
exclude_selectors: exclude_selectors.to_vec(),
|
||||
only_main_content,
|
||||
include_raw_html: false,
|
||||
};
|
||||
|
||||
let extraction =
|
||||
webclaw_core::extract_with_options(&fetch_result.html, Some(&fetch_result.url), &options)
|
||||
.map_err(|e| format!("Extraction failed: {e}"))?;
|
||||
|
||||
// Step 4: Check for JS-rendered pages (low content from large HTML)
|
||||
if needs_js_rendering(extraction.metadata.word_count, &fetch_result.html) {
|
||||
info!(
|
||||
url,
|
||||
word_count = extraction.metadata.word_count,
|
||||
html_len = fetch_result.html.len(),
|
||||
"JS-rendered page detected, falling back to cloud API"
|
||||
);
|
||||
return cloud_fallback(
|
||||
cloud,
|
||||
url,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
formats,
|
||||
)
|
||||
.await;
|
||||
}
|
||||
|
||||
Ok(SmartFetchResult::Local(Box::new(extraction)))
|
||||
}
|
||||
|
||||
async fn cloud_fallback(
|
||||
cloud: Option<&CloudClient>,
|
||||
url: &str,
|
||||
include_selectors: &[String],
|
||||
exclude_selectors: &[String],
|
||||
only_main_content: bool,
|
||||
formats: &[&str],
|
||||
) -> Result<SmartFetchResult, String> {
|
||||
match cloud {
|
||||
Some(c) => {
|
||||
let resp = c
|
||||
.scrape(
|
||||
url,
|
||||
formats,
|
||||
include_selectors,
|
||||
exclude_selectors,
|
||||
only_main_content,
|
||||
)
|
||||
.await?;
|
||||
info!(url, "cloud API fallback successful");
|
||||
Ok(SmartFetchResult::Cloud(resp))
|
||||
}
|
||||
None => Err(format!(
|
||||
"Bot protection detected on {url}. Set WEBCLAW_API_KEY for automatic cloud bypass. \
|
||||
Get a key at https://webclaw.io"
|
||||
)),
|
||||
}
|
||||
}
|
||||
|
|
@ -1,7 +1,6 @@
|
|||
/// webclaw-mcp: MCP (Model Context Protocol) server for webclaw.
|
||||
/// Exposes web extraction tools over stdio transport for AI agents
|
||||
/// like Claude Desktop, Claude Code, and other MCP clients.
|
||||
mod cloud;
|
||||
mod server;
|
||||
mod tools;
|
||||
|
||||
|
|
|
|||
|
|
@ -15,10 +15,17 @@ use serde_json::json;
|
|||
use tracing::{error, info, warn};
|
||||
use url::Url;
|
||||
|
||||
use crate::cloud::{self, CloudClient, SmartFetchResult};
|
||||
use webclaw_fetch::cloud::{self, CloudClient, SmartFetchResult};
|
||||
|
||||
use crate::tools::*;
|
||||
|
||||
pub struct WebclawMcp {
|
||||
/// Holds the registered MCP tools. `rmcp >= 1.3` reads this through a
|
||||
/// derived trait impl (not by name), so rustc's dead-code lint can't
|
||||
/// see the usage and fires a spurious `field tool_router is never
|
||||
/// read` on `cargo install`. The field is essential — dropping it
|
||||
/// would unregister every tool. See issue #30.
|
||||
#[allow(dead_code)]
|
||||
tool_router: ToolRouter<Self>,
|
||||
fetch_client: Arc<webclaw_fetch::FetchClient>,
|
||||
/// Lazily-initialized Firefox client, reused across all tool calls that
|
||||
|
|
@ -711,6 +718,55 @@ impl WebclawMcp {
|
|||
Ok(serde_json::to_string_pretty(&resp).unwrap_or_default())
|
||||
}
|
||||
}
|
||||
|
||||
/// List every vertical extractor the server knows about. Returns a
|
||||
/// JSON array of `{name, label, description, url_patterns}` entries.
|
||||
/// Call this to discover what verticals are available before using
|
||||
/// `vertical_scrape`.
|
||||
#[tool]
|
||||
async fn list_extractors(
|
||||
&self,
|
||||
Parameters(_params): Parameters<ListExtractorsParams>,
|
||||
) -> Result<String, String> {
|
||||
let catalog = webclaw_fetch::extractors::list();
|
||||
serde_json::to_string_pretty(&catalog)
|
||||
.map_err(|e| format!("failed to serialise extractor catalog: {e}"))
|
||||
}
|
||||
|
||||
/// Run a vertical extractor by name and return typed JSON specific
|
||||
/// to the target site (title, price, rating, author, etc.), not
|
||||
/// generic markdown. Use `list_extractors` to discover available
|
||||
/// names. Example names: `reddit`, `github_repo`, `trustpilot_reviews`,
|
||||
/// `youtube_video`, `shopify_product`, `pypi`, `npm`, `arxiv`.
|
||||
///
|
||||
/// Antibot-gated verticals (amazon_product, ebay_listing,
|
||||
/// etsy_listing, trustpilot_reviews) will automatically escalate to
|
||||
/// the webclaw cloud API when local fetch hits bot protection,
|
||||
/// provided `WEBCLAW_API_KEY` is set.
|
||||
#[tool]
|
||||
async fn vertical_scrape(
|
||||
&self,
|
||||
Parameters(params): Parameters<VerticalParams>,
|
||||
) -> Result<String, String> {
|
||||
validate_url(¶ms.url)?;
|
||||
// Use the cached Firefox client, not the default Chrome one.
|
||||
// Reddit's `.json` endpoint rejects the wreq-Chrome TLS
|
||||
// fingerprint with a 403 even from residential IPs (they
|
||||
// ship a fingerprint blocklist that includes common
|
||||
// browser-emulation libraries). The wreq-Firefox fingerprint
|
||||
// still passes, and Firefox is equally fine for every other
|
||||
// vertical in the catalog, so it's a strictly-safer default
|
||||
// for `vertical_scrape` than the generic `scrape` tool's
|
||||
// Chrome default. Matches the CLI `webclaw vertical`
|
||||
// subcommand which already uses Firefox.
|
||||
let client = self.firefox_or_build()?;
|
||||
let data =
|
||||
webclaw_fetch::extractors::dispatch_by_name(client.as_ref(), ¶ms.name, ¶ms.url)
|
||||
.await
|
||||
.map_err(|e| e.to_string())?;
|
||||
serde_json::to_string_pretty(&data)
|
||||
.map_err(|e| format!("failed to serialise extractor output: {e}"))
|
||||
}
|
||||
}
|
||||
|
||||
#[tool_handler]
|
||||
|
|
@ -720,7 +776,8 @@ impl ServerHandler for WebclawMcp {
|
|||
.with_server_info(Implementation::new("webclaw-mcp", env!("CARGO_PKG_VERSION")))
|
||||
.with_instructions(String::from(
|
||||
"Webclaw MCP server -- web content extraction for AI agents. \
|
||||
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search.",
|
||||
Tools: scrape, crawl, map, batch, extract, summarize, diff, brand, research, search, \
|
||||
list_extractors, vertical_scrape.",
|
||||
))
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -103,3 +103,20 @@ pub struct SearchParams {
|
|||
/// Number of results to return (default: 10)
|
||||
pub num_results: Option<u32>,
|
||||
}
|
||||
|
||||
/// Parameters for `vertical_scrape`: run a site-specific extractor by name.
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct VerticalParams {
|
||||
/// Name of the vertical extractor. Call `list_extractors` to see all
|
||||
/// available names. Examples: "reddit", "github_repo", "pypi",
|
||||
/// "trustpilot_reviews", "youtube_video", "shopify_product".
|
||||
pub name: String,
|
||||
/// URL to extract. Must match the URL patterns the extractor claims;
|
||||
/// otherwise the tool returns a clear "URL mismatch" error.
|
||||
pub url: String,
|
||||
}
|
||||
|
||||
/// `list_extractors` takes no arguments but we still need an empty struct
|
||||
/// so rmcp can generate a schema and parse the (empty) JSON-RPC params.
|
||||
#[derive(Debug, Deserialize, JsonSchema)]
|
||||
pub struct ListExtractorsParams {}
|
||||
|
|
|
|||
29
crates/webclaw-server/Cargo.toml
Normal file
29
crates/webclaw-server/Cargo.toml
Normal file
|
|
@ -0,0 +1,29 @@
|
|||
[package]
|
||||
name = "webclaw-server"
|
||||
version.workspace = true
|
||||
edition.workspace = true
|
||||
license.workspace = true
|
||||
repository.workspace = true
|
||||
description = "Minimal REST API server for self-hosting webclaw extraction. Wraps the OSS extraction crates with HTTP endpoints. NOT the production hosted API at api.webclaw.io — this is a stateless, single-binary reference server for local + self-hosted deployments."
|
||||
|
||||
[[bin]]
|
||||
name = "webclaw-server"
|
||||
path = "src/main.rs"
|
||||
|
||||
[dependencies]
|
||||
webclaw-core = { workspace = true }
|
||||
webclaw-fetch = { workspace = true }
|
||||
webclaw-llm = { workspace = true }
|
||||
webclaw-pdf = { workspace = true }
|
||||
|
||||
axum = { version = "0.8", features = ["macros"] }
|
||||
tokio = { workspace = true }
|
||||
tower-http = { version = "0.6", features = ["trace", "cors"] }
|
||||
clap = { workspace = true, features = ["derive", "env"] }
|
||||
serde = { workspace = true }
|
||||
serde_json = { workspace = true }
|
||||
tracing = { workspace = true }
|
||||
tracing-subscriber = { workspace = true, features = ["env-filter"] }
|
||||
anyhow = "1"
|
||||
thiserror = { workspace = true }
|
||||
subtle = "2.6"
|
||||
48
crates/webclaw-server/src/auth.rs
Normal file
48
crates/webclaw-server/src/auth.rs
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
//! Optional bearer-token middleware.
|
||||
//!
|
||||
//! When the server is started without `--api-key`, every request is allowed
|
||||
//! through (server runs in "open" mode — appropriate for `localhost`-only
|
||||
//! deployments). When a key is configured, every `/v1/*` request must
|
||||
//! present `Authorization: Bearer <key>` and the comparison is constant-
|
||||
//! time to avoid timing-leaking the key.
|
||||
|
||||
use axum::{
|
||||
extract::{Request, State},
|
||||
http::StatusCode,
|
||||
middleware::Next,
|
||||
response::Response,
|
||||
};
|
||||
use subtle::ConstantTimeEq;
|
||||
|
||||
use crate::state::AppState;
|
||||
|
||||
/// Axum middleware. Mount with `axum::middleware::from_fn_with_state`.
|
||||
pub async fn require_bearer(
|
||||
State(state): State<AppState>,
|
||||
request: Request,
|
||||
next: Next,
|
||||
) -> Result<Response, StatusCode> {
|
||||
let Some(expected) = state.api_key() else {
|
||||
// Open mode — no key configured. Allow everything.
|
||||
return Ok(next.run(request).await);
|
||||
};
|
||||
|
||||
let Some(header) = request
|
||||
.headers()
|
||||
.get("authorization")
|
||||
.and_then(|v| v.to_str().ok())
|
||||
else {
|
||||
return Err(StatusCode::UNAUTHORIZED);
|
||||
};
|
||||
|
||||
let presented = header
|
||||
.strip_prefix("Bearer ")
|
||||
.or_else(|| header.strip_prefix("bearer "))
|
||||
.ok_or(StatusCode::UNAUTHORIZED)?;
|
||||
|
||||
if presented.as_bytes().ct_eq(expected.as_bytes()).into() {
|
||||
Ok(next.run(request).await)
|
||||
} else {
|
||||
Err(StatusCode::UNAUTHORIZED)
|
||||
}
|
||||
}
|
||||
87
crates/webclaw-server/src/error.rs
Normal file
87
crates/webclaw-server/src/error.rs
Normal file
|
|
@ -0,0 +1,87 @@
|
|||
//! API error type. Maps internal errors to HTTP status codes + JSON.
|
||||
|
||||
use axum::{
|
||||
Json,
|
||||
http::StatusCode,
|
||||
response::{IntoResponse, Response},
|
||||
};
|
||||
use serde_json::json;
|
||||
use thiserror::Error;
|
||||
|
||||
/// Public-facing API error. Always serializes as `{ "error": "..." }`.
|
||||
/// Keep messages user-actionable; internal details belong in tracing logs.
|
||||
///
|
||||
/// `Unauthorized` / `NotFound` / `Internal` are kept on the enum as
|
||||
/// stable variants for handlers that don't exist yet (planned: per-key
|
||||
/// rate-limit responses, dynamic route 404s). Marking them dead-code-OK
|
||||
/// is preferable to inventing them later in three places.
|
||||
#[allow(dead_code)]
|
||||
#[derive(Debug, Error)]
|
||||
pub enum ApiError {
|
||||
#[error("{0}")]
|
||||
BadRequest(String),
|
||||
|
||||
#[error("unauthorized")]
|
||||
Unauthorized,
|
||||
|
||||
#[error("not found")]
|
||||
NotFound,
|
||||
|
||||
#[error("upstream fetch failed: {0}")]
|
||||
Fetch(String),
|
||||
|
||||
#[error("extraction failed: {0}")]
|
||||
Extract(String),
|
||||
|
||||
#[error("LLM provider error: {0}")]
|
||||
Llm(String),
|
||||
|
||||
#[error("internal: {0}")]
|
||||
Internal(String),
|
||||
}
|
||||
|
||||
impl ApiError {
|
||||
pub fn bad_request(msg: impl Into<String>) -> Self {
|
||||
Self::BadRequest(msg.into())
|
||||
}
|
||||
#[allow(dead_code)]
|
||||
pub fn internal(msg: impl Into<String>) -> Self {
|
||||
Self::Internal(msg.into())
|
||||
}
|
||||
|
||||
fn status(&self) -> StatusCode {
|
||||
match self {
|
||||
Self::BadRequest(_) => StatusCode::BAD_REQUEST,
|
||||
Self::Unauthorized => StatusCode::UNAUTHORIZED,
|
||||
Self::NotFound => StatusCode::NOT_FOUND,
|
||||
Self::Fetch(_) => StatusCode::BAD_GATEWAY,
|
||||
Self::Extract(_) | Self::Llm(_) => StatusCode::UNPROCESSABLE_ENTITY,
|
||||
Self::Internal(_) => StatusCode::INTERNAL_SERVER_ERROR,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl IntoResponse for ApiError {
|
||||
fn into_response(self) -> Response {
|
||||
let body = Json(json!({ "error": self.to_string() }));
|
||||
(self.status(), body).into_response()
|
||||
}
|
||||
}
|
||||
|
||||
impl From<webclaw_fetch::FetchError> for ApiError {
|
||||
fn from(e: webclaw_fetch::FetchError) -> Self {
|
||||
Self::Fetch(e.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
impl From<webclaw_core::ExtractError> for ApiError {
|
||||
fn from(e: webclaw_core::ExtractError) -> Self {
|
||||
Self::Extract(e.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
impl From<webclaw_llm::LlmError> for ApiError {
|
||||
fn from(e: webclaw_llm::LlmError) -> Self {
|
||||
Self::Llm(e.to_string())
|
||||
}
|
||||
}
|
||||
123
crates/webclaw-server/src/main.rs
Normal file
123
crates/webclaw-server/src/main.rs
Normal file
|
|
@ -0,0 +1,123 @@
|
|||
//! webclaw-server — minimal REST API for self-hosting webclaw extraction.
|
||||
//!
|
||||
//! This is the OSS reference server. It is intentionally small:
|
||||
//! single binary, stateless, no database, no job queue. It wraps the
|
||||
//! same extraction crates the CLI and MCP server use, exposed over
|
||||
//! HTTP with JSON shapes that mirror the hosted API at
|
||||
//! api.webclaw.io where the underlying capability exists in OSS.
|
||||
//!
|
||||
//! Hosted-only features (anti-bot bypass, JS rendering, async crawl
|
||||
//! jobs, multi-tenant auth, billing) are *not* implemented here and
|
||||
//! never will be — they're closed-source. See the docs for the full
|
||||
//! "what self-hosting gives you vs. what the cloud gives you" matrix.
|
||||
|
||||
mod auth;
|
||||
mod error;
|
||||
mod routes;
|
||||
mod state;
|
||||
|
||||
use std::net::{IpAddr, SocketAddr};
|
||||
use std::time::Duration;
|
||||
|
||||
use axum::{
|
||||
Router,
|
||||
middleware::from_fn_with_state,
|
||||
routing::{get, post},
|
||||
};
|
||||
use clap::Parser;
|
||||
use tower_http::cors::{Any, CorsLayer};
|
||||
use tower_http::trace::TraceLayer;
|
||||
use tracing::info;
|
||||
use tracing_subscriber::{EnvFilter, fmt};
|
||||
|
||||
use crate::state::AppState;
|
||||
|
||||
#[derive(Parser, Debug)]
|
||||
#[command(
|
||||
name = "webclaw-server",
|
||||
version,
|
||||
about = "Minimal self-hosted REST API for webclaw extraction.",
|
||||
long_about = "Stateless single-binary REST API. Wraps the OSS extraction \
|
||||
crates over HTTP. For the full hosted platform (anti-bot, \
|
||||
JS render, async jobs, multi-tenant), use api.webclaw.io."
|
||||
)]
|
||||
struct Args {
|
||||
/// Port to listen on. Env: WEBCLAW_PORT.
|
||||
#[arg(short, long, env = "WEBCLAW_PORT", default_value_t = 3000)]
|
||||
port: u16,
|
||||
|
||||
/// Host to bind to. Env: WEBCLAW_HOST.
|
||||
/// Default `127.0.0.1` keeps the server local-only; set to
|
||||
/// `0.0.0.0` to expose on all interfaces (only do this with
|
||||
/// `--api-key` set or behind a reverse proxy that adds auth).
|
||||
#[arg(long, env = "WEBCLAW_HOST", default_value = "127.0.0.1")]
|
||||
host: IpAddr,
|
||||
|
||||
/// Optional bearer token. Env: WEBCLAW_API_KEY. When set, every
|
||||
/// `/v1/*` request must present `Authorization: Bearer <key>`.
|
||||
/// When unset, the server runs in open mode (no auth) — only
|
||||
/// safe on a local-bound interface or behind another auth layer.
|
||||
#[arg(long, env = "WEBCLAW_API_KEY")]
|
||||
api_key: Option<String>,
|
||||
|
||||
/// Tracing filter. Env: RUST_LOG.
|
||||
#[arg(long, env = "RUST_LOG", default_value = "info,webclaw_server=info")]
|
||||
log: String,
|
||||
}
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> anyhow::Result<()> {
|
||||
let args = Args::parse();
|
||||
|
||||
fmt()
|
||||
.with_env_filter(EnvFilter::try_new(&args.log).unwrap_or_else(|_| EnvFilter::new("info")))
|
||||
.with_target(false)
|
||||
.compact()
|
||||
.init();
|
||||
|
||||
let state = AppState::new(args.api_key.clone())?;
|
||||
|
||||
let v1 = Router::new()
|
||||
.route("/scrape", post(routes::scrape::scrape))
|
||||
.route(
|
||||
"/scrape/{vertical}",
|
||||
post(routes::structured::scrape_vertical),
|
||||
)
|
||||
.route("/crawl", post(routes::crawl::crawl))
|
||||
.route("/map", post(routes::map::map))
|
||||
.route("/batch", post(routes::batch::batch))
|
||||
.route("/extract", post(routes::extract::extract))
|
||||
.route("/extractors", get(routes::structured::list_extractors))
|
||||
.route("/summarize", post(routes::summarize::summarize_route))
|
||||
.route("/diff", post(routes::diff::diff_route))
|
||||
.route("/brand", post(routes::brand::brand))
|
||||
.layer(from_fn_with_state(state.clone(), auth::require_bearer));
|
||||
|
||||
let app = Router::new()
|
||||
.route("/health", get(routes::health::health))
|
||||
.nest("/v1", v1)
|
||||
.layer(
|
||||
// Permissive CORS — same posture as a self-hosted dev tool.
|
||||
// Tighten in front with a reverse proxy if you expose this
|
||||
// publicly.
|
||||
CorsLayer::new()
|
||||
.allow_origin(Any)
|
||||
.allow_methods(Any)
|
||||
.allow_headers(Any)
|
||||
.max_age(Duration::from_secs(3600)),
|
||||
)
|
||||
.layer(TraceLayer::new_for_http())
|
||||
.with_state(state);
|
||||
|
||||
let addr = SocketAddr::from((args.host, args.port));
|
||||
let listener = tokio::net::TcpListener::bind(addr).await?;
|
||||
let auth_status = if args.api_key.is_some() {
|
||||
"bearer auth required"
|
||||
} else {
|
||||
"open mode (no auth)"
|
||||
};
|
||||
info!(%addr, mode = auth_status, "webclaw-server listening");
|
||||
|
||||
axum::serve(listener, app).await?;
|
||||
Ok(())
|
||||
}
|
||||
85
crates/webclaw-server/src/routes/batch.rs
Normal file
85
crates/webclaw-server/src/routes/batch.rs
Normal file
|
|
@ -0,0 +1,85 @@
|
|||
//! POST /v1/batch — fetch + extract many URLs in parallel.
|
||||
//!
|
||||
//! `concurrency` is hard-capped at 20 to avoid hammering targets and
|
||||
//! to bound memory growth for naive callers. For larger batches use
|
||||
//! the hosted API.
|
||||
|
||||
use axum::{Json, extract::State};
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
use webclaw_core::ExtractionOptions;
|
||||
|
||||
use crate::{error::ApiError, state::AppState};
|
||||
|
||||
const HARD_MAX_URLS: usize = 100;
|
||||
const HARD_MAX_CONCURRENCY: usize = 20;
|
||||
|
||||
#[derive(Debug, Deserialize, Default)]
|
||||
#[serde(default)]
|
||||
pub struct BatchRequest {
|
||||
pub urls: Vec<String>,
|
||||
pub concurrency: Option<usize>,
|
||||
pub include_selectors: Vec<String>,
|
||||
pub exclude_selectors: Vec<String>,
|
||||
pub only_main_content: bool,
|
||||
}
|
||||
|
||||
pub async fn batch(
|
||||
State(state): State<AppState>,
|
||||
Json(req): Json<BatchRequest>,
|
||||
) -> Result<Json<Value>, ApiError> {
|
||||
if req.urls.is_empty() {
|
||||
return Err(ApiError::bad_request("`urls` is required"));
|
||||
}
|
||||
if req.urls.len() > HARD_MAX_URLS {
|
||||
return Err(ApiError::bad_request(format!(
|
||||
"too many urls: {} (max {HARD_MAX_URLS})",
|
||||
req.urls.len()
|
||||
)));
|
||||
}
|
||||
|
||||
let concurrency = req.concurrency.unwrap_or(5).clamp(1, HARD_MAX_CONCURRENCY);
|
||||
|
||||
let options = ExtractionOptions {
|
||||
include_selectors: req.include_selectors,
|
||||
exclude_selectors: req.exclude_selectors,
|
||||
only_main_content: req.only_main_content,
|
||||
include_raw_html: false,
|
||||
};
|
||||
|
||||
let url_refs: Vec<&str> = req.urls.iter().map(|s| s.as_str()).collect();
|
||||
let results = state
|
||||
.fetch()
|
||||
.fetch_and_extract_batch_with_options(&url_refs, concurrency, &options)
|
||||
.await;
|
||||
|
||||
let mut ok = 0usize;
|
||||
let mut errors = 0usize;
|
||||
let mut out: Vec<Value> = Vec::with_capacity(results.len());
|
||||
for r in results {
|
||||
match r.result {
|
||||
Ok(extraction) => {
|
||||
ok += 1;
|
||||
out.push(json!({
|
||||
"url": r.url,
|
||||
"metadata": extraction.metadata,
|
||||
"markdown": extraction.content.markdown,
|
||||
}));
|
||||
}
|
||||
Err(e) => {
|
||||
errors += 1;
|
||||
out.push(json!({
|
||||
"url": r.url,
|
||||
"error": e.to_string(),
|
||||
}));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Ok(Json(json!({
|
||||
"total": out.len(),
|
||||
"completed": ok,
|
||||
"errors": errors,
|
||||
"results": out,
|
||||
})))
|
||||
}
|
||||
32
crates/webclaw-server/src/routes/brand.rs
Normal file
32
crates/webclaw-server/src/routes/brand.rs
Normal file
|
|
@ -0,0 +1,32 @@
|
|||
//! POST /v1/brand — extract brand identity (colors, fonts, logo) from a page.
|
||||
//!
|
||||
//! Pure DOM/CSS analysis — no LLM, no network beyond the page fetch itself.
|
||||
|
||||
use axum::{Json, extract::State};
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
use webclaw_core::brand::extract_brand;
|
||||
|
||||
use crate::{error::ApiError, state::AppState};
|
||||
|
||||
#[derive(Debug, Deserialize)]
|
||||
pub struct BrandRequest {
|
||||
pub url: String,
|
||||
}
|
||||
|
||||
pub async fn brand(
|
||||
State(state): State<AppState>,
|
||||
Json(req): Json<BrandRequest>,
|
||||
) -> Result<Json<Value>, ApiError> {
|
||||
if req.url.trim().is_empty() {
|
||||
return Err(ApiError::bad_request("`url` is required"));
|
||||
}
|
||||
|
||||
let fetched = state.fetch().fetch(&req.url).await?;
|
||||
let brand = extract_brand(&fetched.html, Some(&fetched.url));
|
||||
|
||||
Ok(Json(json!({
|
||||
"url": req.url,
|
||||
"brand": brand,
|
||||
})))
|
||||
}
|
||||
85
crates/webclaw-server/src/routes/crawl.rs
Normal file
85
crates/webclaw-server/src/routes/crawl.rs
Normal file
|
|
@ -0,0 +1,85 @@
|
|||
//! POST /v1/crawl — synchronous BFS crawl.
|
||||
//!
|
||||
//! NOTE: this server is stateless — there is no job queue. Crawls run
|
||||
//! inline and return when complete. `max_pages` is hard-capped at 500
|
||||
//! to avoid OOM on naive callers. For large crawls + async jobs, use
|
||||
//! the hosted API at api.webclaw.io.
|
||||
|
||||
use axum::{Json, extract::State};
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
use std::time::Duration;
|
||||
use webclaw_fetch::{CrawlConfig, Crawler, FetchConfig};
|
||||
|
||||
use crate::{error::ApiError, state::AppState};
|
||||
|
||||
const HARD_MAX_PAGES: usize = 500;
|
||||
|
||||
#[derive(Debug, Deserialize, Default)]
|
||||
#[serde(default)]
|
||||
pub struct CrawlRequest {
|
||||
pub url: String,
|
||||
pub max_depth: Option<usize>,
|
||||
pub max_pages: Option<usize>,
|
||||
pub use_sitemap: bool,
|
||||
pub concurrency: Option<usize>,
|
||||
pub allow_subdomains: bool,
|
||||
pub allow_external_links: bool,
|
||||
pub include_patterns: Vec<String>,
|
||||
pub exclude_patterns: Vec<String>,
|
||||
}
|
||||
|
||||
pub async fn crawl(
|
||||
State(_state): State<AppState>,
|
||||
Json(req): Json<CrawlRequest>,
|
||||
) -> Result<Json<Value>, ApiError> {
|
||||
if req.url.trim().is_empty() {
|
||||
return Err(ApiError::bad_request("`url` is required"));
|
||||
}
|
||||
let max_pages = req.max_pages.unwrap_or(50).min(HARD_MAX_PAGES);
|
||||
let max_depth = req.max_depth.unwrap_or(3);
|
||||
let concurrency = req.concurrency.unwrap_or(5).min(20);
|
||||
|
||||
let config = CrawlConfig {
|
||||
fetch: FetchConfig::default(),
|
||||
max_depth,
|
||||
max_pages,
|
||||
concurrency,
|
||||
delay: Duration::from_millis(200),
|
||||
path_prefix: None,
|
||||
use_sitemap: req.use_sitemap,
|
||||
include_patterns: req.include_patterns,
|
||||
exclude_patterns: req.exclude_patterns,
|
||||
allow_subdomains: req.allow_subdomains,
|
||||
allow_external_links: req.allow_external_links,
|
||||
progress_tx: None,
|
||||
cancel_flag: None,
|
||||
};
|
||||
|
||||
let crawler = Crawler::new(&req.url, config).map_err(ApiError::from)?;
|
||||
let result = crawler.crawl(&req.url, None).await;
|
||||
|
||||
let pages: Vec<Value> = result
|
||||
.pages
|
||||
.iter()
|
||||
.map(|p| {
|
||||
json!({
|
||||
"url": p.url,
|
||||
"depth": p.depth,
|
||||
"metadata": p.extraction.as_ref().map(|e| &e.metadata),
|
||||
"markdown": p.extraction.as_ref().map(|e| e.content.markdown.as_str()).unwrap_or(""),
|
||||
"error": p.error,
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
|
||||
Ok(Json(json!({
|
||||
"url": req.url,
|
||||
"status": "completed",
|
||||
"total": result.total,
|
||||
"completed": result.ok,
|
||||
"errors": result.errors,
|
||||
"elapsed_secs": result.elapsed_secs,
|
||||
"pages": pages,
|
||||
})))
|
||||
}
|
||||
92
crates/webclaw-server/src/routes/diff.rs
Normal file
92
crates/webclaw-server/src/routes/diff.rs
Normal file
|
|
@ -0,0 +1,92 @@
|
|||
//! POST /v1/diff — compare current page content against a prior snapshot.
|
||||
//!
|
||||
//! Caller passes either a full prior `ExtractionResult` or the minimal
|
||||
//! `{ markdown, metadata }` shape used by the hosted API. We re-fetch
|
||||
//! the URL, extract, and run `webclaw_core::diff::diff` over the pair.
|
||||
|
||||
use axum::{Json, extract::State};
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
use webclaw_core::{Content, ExtractionResult, Metadata, diff::diff};
|
||||
|
||||
use crate::{error::ApiError, state::AppState};
|
||||
|
||||
#[derive(Debug, Deserialize)]
|
||||
pub struct DiffRequest {
|
||||
pub url: String,
|
||||
pub previous: PreviousSnapshot,
|
||||
}
|
||||
|
||||
/// Either a full prior extraction, or the minimal `{ markdown, metadata }`
|
||||
/// shape returned by /v1/scrape. Untagged so callers can send whichever
|
||||
/// they have on hand.
|
||||
#[derive(Debug, Deserialize)]
|
||||
#[serde(untagged)]
|
||||
pub enum PreviousSnapshot {
|
||||
Full(ExtractionResult),
|
||||
Minimal {
|
||||
#[serde(default)]
|
||||
markdown: String,
|
||||
#[serde(default)]
|
||||
metadata: Option<Metadata>,
|
||||
},
|
||||
}
|
||||
|
||||
impl PreviousSnapshot {
|
||||
fn into_extraction(self) -> ExtractionResult {
|
||||
match self {
|
||||
Self::Full(r) => r,
|
||||
Self::Minimal { markdown, metadata } => ExtractionResult {
|
||||
metadata: metadata.unwrap_or_else(empty_metadata),
|
||||
content: Content {
|
||||
markdown,
|
||||
plain_text: String::new(),
|
||||
links: Vec::new(),
|
||||
images: Vec::new(),
|
||||
code_blocks: Vec::new(),
|
||||
raw_html: None,
|
||||
},
|
||||
domain_data: None,
|
||||
structured_data: Vec::new(),
|
||||
},
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn empty_metadata() -> Metadata {
|
||||
Metadata {
|
||||
title: None,
|
||||
description: None,
|
||||
author: None,
|
||||
published_date: None,
|
||||
language: None,
|
||||
url: None,
|
||||
site_name: None,
|
||||
image: None,
|
||||
favicon: None,
|
||||
word_count: 0,
|
||||
}
|
||||
}
|
||||
|
||||
pub async fn diff_route(
|
||||
State(state): State<AppState>,
|
||||
Json(req): Json<DiffRequest>,
|
||||
) -> Result<Json<Value>, ApiError> {
|
||||
if req.url.trim().is_empty() {
|
||||
return Err(ApiError::bad_request("`url` is required"));
|
||||
}
|
||||
|
||||
let current = state.fetch().fetch_and_extract(&req.url).await?;
|
||||
let previous = req.previous.into_extraction();
|
||||
let result = diff(&previous, ¤t);
|
||||
|
||||
Ok(Json(json!({
|
||||
"url": req.url,
|
||||
"status": result.status,
|
||||
"diff": result.text_diff,
|
||||
"metadata_changes": result.metadata_changes,
|
||||
"links_added": result.links_added,
|
||||
"links_removed": result.links_removed,
|
||||
"word_count_delta": result.word_count_delta,
|
||||
})))
|
||||
}
|
||||
81
crates/webclaw-server/src/routes/extract.rs
Normal file
81
crates/webclaw-server/src/routes/extract.rs
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
//! POST /v1/extract — LLM-powered structured extraction.
|
||||
//!
|
||||
//! Two modes:
|
||||
//! * `schema` — JSON Schema describing what to extract.
|
||||
//! * `prompt` — natural-language instructions.
|
||||
//!
|
||||
//! At least one must be provided. The provider chain is built per
|
||||
//! request from env (Ollama -> OpenAI -> Anthropic). Self-hosters
|
||||
//! get the same fallback behaviour as the CLI.
|
||||
|
||||
use axum::{Json, extract::State};
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
use webclaw_llm::{ProviderChain, extract::extract_json, extract::extract_with_prompt};
|
||||
|
||||
use crate::{error::ApiError, state::AppState};
|
||||
|
||||
#[derive(Debug, Deserialize, Default)]
|
||||
#[serde(default)]
|
||||
pub struct ExtractRequest {
|
||||
pub url: String,
|
||||
pub schema: Option<Value>,
|
||||
pub prompt: Option<String>,
|
||||
/// Optional override of the provider model name (e.g. `gpt-4o-mini`).
|
||||
pub model: Option<String>,
|
||||
}
|
||||
|
||||
pub async fn extract(
|
||||
State(state): State<AppState>,
|
||||
Json(req): Json<ExtractRequest>,
|
||||
) -> Result<Json<Value>, ApiError> {
|
||||
if req.url.trim().is_empty() {
|
||||
return Err(ApiError::bad_request("`url` is required"));
|
||||
}
|
||||
let has_schema = req.schema.is_some();
|
||||
let has_prompt = req
|
||||
.prompt
|
||||
.as_deref()
|
||||
.map(|p| !p.trim().is_empty())
|
||||
.unwrap_or(false);
|
||||
if !has_schema && !has_prompt {
|
||||
return Err(ApiError::bad_request(
|
||||
"either `schema` or `prompt` is required",
|
||||
));
|
||||
}
|
||||
|
||||
// Fetch + extract first so we feed the LLM clean markdown instead of
|
||||
// raw HTML. Cheaper tokens, better signal.
|
||||
let extraction = state.fetch().fetch_and_extract(&req.url).await?;
|
||||
let content = if extraction.content.markdown.trim().is_empty() {
|
||||
extraction.content.plain_text.clone()
|
||||
} else {
|
||||
extraction.content.markdown.clone()
|
||||
};
|
||||
if content.trim().is_empty() {
|
||||
return Err(ApiError::Extract(
|
||||
"no extractable content on page".to_string(),
|
||||
));
|
||||
}
|
||||
|
||||
let chain = ProviderChain::default().await;
|
||||
if chain.is_empty() {
|
||||
return Err(ApiError::Llm(
|
||||
"no LLM providers configured (set OLLAMA_HOST, OPENAI_API_KEY, or ANTHROPIC_API_KEY)"
|
||||
.to_string(),
|
||||
));
|
||||
}
|
||||
|
||||
let model = req.model.as_deref();
|
||||
let data = if let Some(schema) = req.schema.as_ref() {
|
||||
extract_json(&content, schema, &chain, model).await?
|
||||
} else {
|
||||
let prompt = req.prompt.as_deref().unwrap_or_default();
|
||||
extract_with_prompt(&content, prompt, &chain, model).await?
|
||||
};
|
||||
|
||||
Ok(Json(json!({
|
||||
"url": req.url,
|
||||
"data": data,
|
||||
})))
|
||||
}
|
||||
10
crates/webclaw-server/src/routes/health.rs
Normal file
10
crates/webclaw-server/src/routes/health.rs
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
use axum::Json;
|
||||
use serde_json::{Value, json};
|
||||
|
||||
pub async fn health() -> Json<Value> {
|
||||
Json(json!({
|
||||
"status": "ok",
|
||||
"version": env!("CARGO_PKG_VERSION"),
|
||||
"service": "webclaw-server",
|
||||
}))
|
||||
}
|
||||
49
crates/webclaw-server/src/routes/map.rs
Normal file
49
crates/webclaw-server/src/routes/map.rs
Normal file
|
|
@ -0,0 +1,49 @@
|
|||
//! POST /v1/map — discover URLs from a site's sitemaps.
|
||||
//!
|
||||
//! Walks robots.txt + common sitemap paths, recursively resolves
|
||||
//! `<sitemapindex>` files, and returns the deduplicated list of URLs.
|
||||
|
||||
use axum::{Json, extract::State};
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
use webclaw_fetch::sitemap;
|
||||
|
||||
use crate::{error::ApiError, state::AppState};
|
||||
|
||||
#[derive(Debug, Deserialize)]
|
||||
pub struct MapRequest {
|
||||
pub url: String,
|
||||
/// When true, return the full SitemapEntry objects (with lastmod,
|
||||
/// priority, changefreq). Defaults to false → bare URL strings,
|
||||
/// matching the hosted-API shape.
|
||||
#[serde(default)]
|
||||
pub include_metadata: bool,
|
||||
}
|
||||
|
||||
pub async fn map(
|
||||
State(state): State<AppState>,
|
||||
Json(req): Json<MapRequest>,
|
||||
) -> Result<Json<Value>, ApiError> {
|
||||
if req.url.trim().is_empty() {
|
||||
return Err(ApiError::bad_request("`url` is required"));
|
||||
}
|
||||
|
||||
let entries = sitemap::discover(state.fetch(), &req.url).await?;
|
||||
|
||||
let body = if req.include_metadata {
|
||||
json!({
|
||||
"url": req.url,
|
||||
"count": entries.len(),
|
||||
"urls": entries,
|
||||
})
|
||||
} else {
|
||||
let urls: Vec<&str> = entries.iter().map(|e| e.url.as_str()).collect();
|
||||
json!({
|
||||
"url": req.url,
|
||||
"count": urls.len(),
|
||||
"urls": urls,
|
||||
})
|
||||
};
|
||||
|
||||
Ok(Json(body))
|
||||
}
|
||||
19
crates/webclaw-server/src/routes/mod.rs
Normal file
19
crates/webclaw-server/src/routes/mod.rs
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
//! HTTP route handlers.
|
||||
//!
|
||||
//! The OSS server exposes a deliberately small surface that mirrors the
|
||||
//! hosted-API JSON shapes where the underlying capability exists in the
|
||||
//! OSS crates. Endpoints that depend on private infrastructure
|
||||
//! (anti-bot bypass with stealth Chrome, JS rendering at scale,
|
||||
//! per-user auth, billing, async job queues, agent loops) are
|
||||
//! intentionally not implemented here. Use api.webclaw.io for those.
|
||||
|
||||
pub mod batch;
|
||||
pub mod brand;
|
||||
pub mod crawl;
|
||||
pub mod diff;
|
||||
pub mod extract;
|
||||
pub mod health;
|
||||
pub mod map;
|
||||
pub mod scrape;
|
||||
pub mod structured;
|
||||
pub mod summarize;
|
||||
108
crates/webclaw-server/src/routes/scrape.rs
Normal file
108
crates/webclaw-server/src/routes/scrape.rs
Normal file
|
|
@ -0,0 +1,108 @@
|
|||
//! POST /v1/scrape — fetch a URL, run extraction, return the requested
|
||||
//! formats. JSON shape mirrors the hosted-API response where possible so
|
||||
//! migrating from self-hosted → cloud is a config change, not a code one.
|
||||
|
||||
use axum::{Json, extract::State};
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
use webclaw_core::{ExtractionOptions, llm::to_llm_text};
|
||||
|
||||
use crate::{error::ApiError, state::AppState};
|
||||
|
||||
#[derive(Debug, Deserialize, Default)]
|
||||
#[serde(default)]
|
||||
pub struct ScrapeRequest {
|
||||
pub url: String,
|
||||
/// Output formats. Allowed: "markdown", "text", "llm", "json", "html".
|
||||
/// Defaults to ["markdown"]. Accepts a single string ("format")
|
||||
/// or an array ("formats") for hosted-API compatibility.
|
||||
#[serde(alias = "format")]
|
||||
pub formats: ScrapeFormats,
|
||||
pub include_selectors: Vec<String>,
|
||||
pub exclude_selectors: Vec<String>,
|
||||
pub only_main_content: bool,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize)]
|
||||
#[serde(untagged)]
|
||||
pub enum ScrapeFormats {
|
||||
One(String),
|
||||
Many(Vec<String>),
|
||||
}
|
||||
|
||||
impl Default for ScrapeFormats {
|
||||
fn default() -> Self {
|
||||
Self::Many(vec!["markdown".into()])
|
||||
}
|
||||
}
|
||||
|
||||
impl ScrapeFormats {
|
||||
fn as_vec(&self) -> Vec<String> {
|
||||
match self {
|
||||
Self::One(s) => vec![s.clone()],
|
||||
Self::Many(v) => v.clone(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub async fn scrape(
|
||||
State(state): State<AppState>,
|
||||
Json(req): Json<ScrapeRequest>,
|
||||
) -> Result<Json<Value>, ApiError> {
|
||||
if req.url.trim().is_empty() {
|
||||
return Err(ApiError::bad_request("`url` is required"));
|
||||
}
|
||||
let formats = req.formats.as_vec();
|
||||
|
||||
let options = ExtractionOptions {
|
||||
include_selectors: req.include_selectors,
|
||||
exclude_selectors: req.exclude_selectors,
|
||||
only_main_content: req.only_main_content,
|
||||
include_raw_html: formats.iter().any(|f| f == "html"),
|
||||
};
|
||||
|
||||
let extraction = state
|
||||
.fetch()
|
||||
.fetch_and_extract_with_options(&req.url, &options)
|
||||
.await?;
|
||||
|
||||
let mut body = json!({
|
||||
"url": extraction.metadata.url.clone().unwrap_or_else(|| req.url.clone()),
|
||||
"metadata": extraction.metadata,
|
||||
});
|
||||
let obj = body.as_object_mut().expect("json::object");
|
||||
|
||||
for f in &formats {
|
||||
match f.as_str() {
|
||||
"markdown" => {
|
||||
obj.insert("markdown".into(), json!(extraction.content.markdown));
|
||||
}
|
||||
"text" => {
|
||||
obj.insert("text".into(), json!(extraction.content.plain_text));
|
||||
}
|
||||
"llm" => {
|
||||
let llm = to_llm_text(&extraction, extraction.metadata.url.as_deref());
|
||||
obj.insert("llm".into(), json!(llm));
|
||||
}
|
||||
"html" => {
|
||||
if let Some(raw) = &extraction.content.raw_html {
|
||||
obj.insert("html".into(), json!(raw));
|
||||
}
|
||||
}
|
||||
"json" => {
|
||||
obj.insert("json".into(), json!(extraction));
|
||||
}
|
||||
other => {
|
||||
return Err(ApiError::bad_request(format!(
|
||||
"unknown format: '{other}' (allowed: markdown, text, llm, html, json)"
|
||||
)));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if !extraction.structured_data.is_empty() {
|
||||
obj.insert("structured_data".into(), json!(extraction.structured_data));
|
||||
}
|
||||
|
||||
Ok(Json(body))
|
||||
}
|
||||
55
crates/webclaw-server/src/routes/structured.rs
Normal file
55
crates/webclaw-server/src/routes/structured.rs
Normal file
|
|
@ -0,0 +1,55 @@
|
|||
//! `POST /v1/scrape/{vertical}` and `GET /v1/extractors`.
|
||||
//!
|
||||
//! Vertical extractors return typed JSON instead of generic markdown.
|
||||
//! See `webclaw_fetch::extractors` for the catalog and per-site logic.
|
||||
|
||||
use axum::{
|
||||
Json,
|
||||
extract::{Path, State},
|
||||
};
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
use webclaw_fetch::extractors::{self, ExtractorDispatchError};
|
||||
|
||||
use crate::{error::ApiError, state::AppState};
|
||||
|
||||
#[derive(Debug, Deserialize)]
|
||||
pub struct ScrapeRequest {
|
||||
pub url: String,
|
||||
}
|
||||
|
||||
/// Map dispatcher errors to ApiError so users get clean HTTP statuses
|
||||
/// instead of opaque 500s.
|
||||
impl From<ExtractorDispatchError> for ApiError {
|
||||
fn from(e: ExtractorDispatchError) -> Self {
|
||||
match e {
|
||||
ExtractorDispatchError::UnknownVertical(_) => ApiError::NotFound,
|
||||
ExtractorDispatchError::UrlMismatch { .. } => ApiError::bad_request(e.to_string()),
|
||||
ExtractorDispatchError::Fetch(f) => ApiError::Fetch(f.to_string()),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// `GET /v1/extractors` — catalog of all available verticals.
|
||||
pub async fn list_extractors() -> Json<Value> {
|
||||
Json(json!({
|
||||
"extractors": extractors::list(),
|
||||
}))
|
||||
}
|
||||
|
||||
/// `POST /v1/scrape/{vertical}` — explicit vertical, e.g. /v1/scrape/reddit.
|
||||
pub async fn scrape_vertical(
|
||||
State(state): State<AppState>,
|
||||
Path(vertical): Path<String>,
|
||||
Json(req): Json<ScrapeRequest>,
|
||||
) -> Result<Json<Value>, ApiError> {
|
||||
if req.url.trim().is_empty() {
|
||||
return Err(ApiError::bad_request("`url` is required"));
|
||||
}
|
||||
let data = extractors::dispatch_by_name(state.fetch(), &vertical, &req.url).await?;
|
||||
Ok(Json(json!({
|
||||
"vertical": vertical,
|
||||
"url": req.url,
|
||||
"data": data,
|
||||
})))
|
||||
}
|
||||
52
crates/webclaw-server/src/routes/summarize.rs
Normal file
52
crates/webclaw-server/src/routes/summarize.rs
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
//! POST /v1/summarize — LLM-powered page summary.
|
||||
|
||||
use axum::{Json, extract::State};
|
||||
use serde::Deserialize;
|
||||
use serde_json::{Value, json};
|
||||
use webclaw_llm::{ProviderChain, summarize::summarize};
|
||||
|
||||
use crate::{error::ApiError, state::AppState};
|
||||
|
||||
#[derive(Debug, Deserialize, Default)]
|
||||
#[serde(default)]
|
||||
pub struct SummarizeRequest {
|
||||
pub url: String,
|
||||
pub max_sentences: Option<usize>,
|
||||
pub model: Option<String>,
|
||||
}
|
||||
|
||||
pub async fn summarize_route(
|
||||
State(state): State<AppState>,
|
||||
Json(req): Json<SummarizeRequest>,
|
||||
) -> Result<Json<Value>, ApiError> {
|
||||
if req.url.trim().is_empty() {
|
||||
return Err(ApiError::bad_request("`url` is required"));
|
||||
}
|
||||
|
||||
let extraction = state.fetch().fetch_and_extract(&req.url).await?;
|
||||
let content = if extraction.content.markdown.trim().is_empty() {
|
||||
extraction.content.plain_text.clone()
|
||||
} else {
|
||||
extraction.content.markdown.clone()
|
||||
};
|
||||
if content.trim().is_empty() {
|
||||
return Err(ApiError::Extract(
|
||||
"no extractable content on page".to_string(),
|
||||
));
|
||||
}
|
||||
|
||||
let chain = ProviderChain::default().await;
|
||||
if chain.is_empty() {
|
||||
return Err(ApiError::Llm(
|
||||
"no LLM providers configured (set OLLAMA_HOST, OPENAI_API_KEY, or ANTHROPIC_API_KEY)"
|
||||
.to_string(),
|
||||
));
|
||||
}
|
||||
|
||||
let summary = summarize(&content, req.max_sentences, &chain, req.model.as_deref()).await?;
|
||||
|
||||
Ok(Json(json!({
|
||||
"url": req.url,
|
||||
"summary": summary,
|
||||
})))
|
||||
}
|
||||
107
crates/webclaw-server/src/state.rs
Normal file
107
crates/webclaw-server/src/state.rs
Normal file
|
|
@ -0,0 +1,107 @@
|
|||
//! Shared application state. Cheap to clone via Arc; held by the axum
|
||||
//! Router for the life of the process.
|
||||
//!
|
||||
//! Two unrelated keys get carried here:
|
||||
//!
|
||||
//! 1. [`AppState::api_key`] — the **bearer token clients must present**
|
||||
//! to call this server. Set via `WEBCLAW_API_KEY` / `--api-key`.
|
||||
//! Unset = open mode.
|
||||
//! 2. The inner [`webclaw_fetch::cloud::CloudClient`] (if any) — our
|
||||
//! **outbound** credential for api.webclaw.io, used by extractors
|
||||
//! that escalate on antibot. Set via `WEBCLAW_CLOUD_API_KEY`.
|
||||
//! Unset = hard-site extractors return a "set WEBCLAW_CLOUD_API_KEY"
|
||||
//! error with a signup link.
|
||||
//!
|
||||
//! Different variables on purpose: conflating the two means operators
|
||||
//! who want their server behind an auth token can't also enable cloud
|
||||
//! fallback, and vice versa.
|
||||
|
||||
use std::sync::Arc;
|
||||
use tracing::info;
|
||||
use webclaw_fetch::cloud::CloudClient;
|
||||
use webclaw_fetch::{BrowserProfile, FetchClient, FetchConfig};
|
||||
|
||||
/// Single-process state shared across all request handlers.
|
||||
#[derive(Clone)]
|
||||
pub struct AppState {
|
||||
inner: Arc<Inner>,
|
||||
}
|
||||
|
||||
struct Inner {
|
||||
/// Wrapped in `Arc` because `fetch_and_extract_batch_with_options`
|
||||
/// (used by the /v1/batch handler) takes `self: &Arc<Self>` so it
|
||||
/// can clone the client into spawned tasks. The single-call handlers
|
||||
/// auto-deref `&Arc<FetchClient>` -> `&FetchClient`, so this costs
|
||||
/// them nothing.
|
||||
pub fetch: Arc<FetchClient>,
|
||||
/// Inbound bearer-auth token for this server's own `/v1/*` surface.
|
||||
pub api_key: Option<String>,
|
||||
}
|
||||
|
||||
impl AppState {
|
||||
/// Build the application state. The fetch client is constructed once
|
||||
/// and shared across requests so connection pools + browser profile
|
||||
/// state don't churn per request.
|
||||
///
|
||||
/// `inbound_api_key` is the bearer token clients must present;
|
||||
/// cloud-fallback credentials come from the env (checked here).
|
||||
pub fn new(inbound_api_key: Option<String>) -> anyhow::Result<Self> {
|
||||
let config = FetchConfig {
|
||||
browser: BrowserProfile::Firefox,
|
||||
..FetchConfig::default()
|
||||
};
|
||||
let mut fetch = FetchClient::new(config)
|
||||
.map_err(|e| anyhow::anyhow!("failed to build fetch client: {e}"))?;
|
||||
|
||||
// Cloud fallback: only activates when the operator has provided
|
||||
// an api.webclaw.io key. Supports both WEBCLAW_CLOUD_API_KEY
|
||||
// (preferred, disambiguates from the inbound-auth key) and
|
||||
// WEBCLAW_API_KEY as a fallback when there's no inbound key
|
||||
// configured (backwards compat with MCP / CLI conventions).
|
||||
if let Some(cloud) = build_cloud_client(inbound_api_key.as_deref()) {
|
||||
info!(
|
||||
base = cloud.base_url(),
|
||||
"cloud fallback enabled — antibot-protected sites will escalate via api.webclaw.io"
|
||||
);
|
||||
fetch = fetch.with_cloud(cloud);
|
||||
}
|
||||
|
||||
Ok(Self {
|
||||
inner: Arc::new(Inner {
|
||||
fetch: Arc::new(fetch),
|
||||
api_key: inbound_api_key,
|
||||
}),
|
||||
})
|
||||
}
|
||||
|
||||
pub fn fetch(&self) -> &Arc<FetchClient> {
|
||||
&self.inner.fetch
|
||||
}
|
||||
|
||||
pub fn api_key(&self) -> Option<&str> {
|
||||
self.inner.api_key.as_deref()
|
||||
}
|
||||
}
|
||||
|
||||
/// Resolve the outbound cloud key. Prefers `WEBCLAW_CLOUD_API_KEY`;
|
||||
/// falls back to `WEBCLAW_API_KEY` *only* when no inbound key is
|
||||
/// configured (i.e. open mode — the same env var can't mean two
|
||||
/// things to one process).
|
||||
fn build_cloud_client(inbound_api_key: Option<&str>) -> Option<CloudClient> {
|
||||
let cloud_key = std::env::var("WEBCLAW_CLOUD_API_KEY").ok();
|
||||
if let Some(k) = cloud_key.as_deref()
|
||||
&& !k.trim().is_empty()
|
||||
{
|
||||
return Some(CloudClient::with_key(k));
|
||||
}
|
||||
// Reuse WEBCLAW_API_KEY only when not also acting as our own
|
||||
// inbound-auth token — otherwise we'd be telling the operator
|
||||
// they can't have both.
|
||||
if inbound_api_key.is_none()
|
||||
&& let Ok(k) = std::env::var("WEBCLAW_API_KEY")
|
||||
&& !k.trim().is_empty()
|
||||
{
|
||||
return Some(CloudClient::with_key(k));
|
||||
}
|
||||
None
|
||||
}
|
||||
33
docker-entrypoint.sh
Executable file
33
docker-entrypoint.sh
Executable file
|
|
@ -0,0 +1,33 @@
|
|||
#!/bin/sh
|
||||
# webclaw docker entrypoint.
|
||||
#
|
||||
# Behaves like the real binary when the first arg looks like a webclaw arg
|
||||
# (URL or flag), so `docker run ghcr.io/0xmassi/webclaw https://example.com`
|
||||
# still works. But gets out of the way when the first arg looks like a
|
||||
# different command (e.g. `./setup.sh`, `bash`, `sh -c ...`), so this image
|
||||
# can be used as a FROM base in downstream Dockerfiles with a custom CMD.
|
||||
#
|
||||
# Test matrix:
|
||||
# docker run IMAGE https://example.com → webclaw https://example.com
|
||||
# docker run IMAGE --help → webclaw --help
|
||||
# docker run IMAGE --file page.html → webclaw --file page.html
|
||||
# docker run IMAGE --stdin < page.html → webclaw --stdin
|
||||
# docker run IMAGE bash → bash
|
||||
# docker run IMAGE ./setup.sh → ./setup.sh
|
||||
# docker run IMAGE → webclaw --help (default CMD)
|
||||
#
|
||||
# Root cause fixed: v0.3.13 switched CMD→ENTRYPOINT to make the first use
|
||||
# case work, which trapped the last four. This shim restores all of them.
|
||||
|
||||
set -e
|
||||
|
||||
# If the first arg starts with `-`, `http://`, or `https://`, treat the
|
||||
# whole arg list as webclaw flags/URL.
|
||||
if [ "$#" -gt 0 ] && {
|
||||
[ "${1#-}" != "$1" ] || \
|
||||
[ "${1#http://}" != "$1" ] || \
|
||||
[ "${1#https://}" != "$1" ]; }; then
|
||||
set -- webclaw "$@"
|
||||
fi
|
||||
|
||||
exec "$@"
|
||||
Loading…
Add table
Add a link
Reference in a new issue