2026-03-29 16:40:10 +02:00
|
|
|
//! webclaw-fetch: HTTP client layer with browser TLS fingerprint impersonation.
|
2026-04-01 18:04:55 +02:00
|
|
|
//! Uses wreq (BoringSSL) for browser-grade TLS + HTTP/2 fingerprinting.
|
2026-03-29 16:40:10 +02:00
|
|
|
//! Automatically detects PDF responses and delegates to webclaw-pdf.
|
2026-03-23 18:31:11 +01:00
|
|
|
pub mod browser;
|
|
|
|
|
pub mod client;
|
refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch
The local-first / cloud-fallback flow was duplicated in two places:
- webclaw-mcp/src/cloud.rs (302 lines, canonical)
- webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid
pulling rmcp as a dep)
Move to the shared crate where all vertical extractors and the new
webclaw-server can also reach it.
## New module: webclaw-fetch/src/cloud.rs
Single canonical home. Consolidates both previous versions and
promotes the error type from stringy to typed:
- `CloudError` enum with dedicated variants for the four HTTP
outcomes callers act on differently — 401 (key rejected),
402 (insufficient plan), 429 (rate limited), plus ServerError /
Network / ParseFailed. Each variant's Display message ends with
an actionable URL (signup / pricing / dashboard) so API consumers
can surface it verbatim.
- `From<CloudError> for String` bridge so the dozen existing
`.await?` call sites in MCP / CLI that expected `Result<_, String>`
keep compiling. We can migrate them to the typed error per-site
later without a churn commit.
- `CloudClient::new(Option<&str>)` matches the CLI's `--api-key`
flag pattern (explicit key wins, env fallback, None when empty).
`::from_env()` kept for MCP-style call sites.
- `with_key_and_base` for staging / integration tests.
- `scrape / post / get / fetch_html` — `fetch_html` is new, a
convenience that calls /v1/scrape with formats=["html"] and
returns the raw HTML string so vertical extractors can plug
antibot-bypassed HTML straight into their parsers.
- `is_bot_protected` + `needs_js_rendering` detectors moved
over verbatim. Detection patterns are public (CF / DataDome /
AWS WAF challenge-page signatures) — no moat leak.
- `smart_fetch` kept on the original `Result<_, String>`
signature so MCP's six call sites compile unchanged.
- `smart_fetch_html` is new: the local-first-then-cloud flow
for the vertical-extractor pattern, returning the typed
`CloudError` so extractors can emit precise upgrade-path
messages.
## Cleanup
- Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to
`webclaw_fetch::cloud::*`. Dropped reqwest as a direct dep of
webclaw-mcp (it only used it for the old cloud client).
- Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its
webhook / on-change / research HTTP calls.
- webclaw-fetch now has reqwest as a direct dep. It was already
transitively pulled in by webclaw-llm; this just makes the
dependency relationship explicit at the call site.
## Tests
16 new unit tests cover:
- CloudError status mapping (401/402/429/5xx)
- NotConfigured error includes signup URL
- CloudClient::new explicit-key-wins-over-env + empty-string = None
- base_url strips trailing slash
- Detector matrix (CF challenge / Turnstile / real content with
embedded Turnstile / SPA skeleton / real article with script tags)
- truncate respects char boundaries (don't slice inside UTF-8)
Full workspace test suite still passes (~500 tests). fmt + clippy
clean. No behavior change for existing MCP / CLI call sites.
2026-04-22 16:05:44 +02:00
|
|
|
pub mod cloud;
|
2026-03-23 18:31:11 +01:00
|
|
|
pub mod crawler;
|
2026-03-26 15:28:23 +01:00
|
|
|
pub mod document;
|
2026-03-23 18:31:11 +01:00
|
|
|
pub mod error;
|
feat(extractors): add vertical extractors module + first 6 verticals
New extractors module returns site-specific typed JSON instead of
generic markdown. Each extractor:
- declares a URL pattern via matches()
- fetches from the site's official JSON API where one exists
- returns a typed serde_json::Value with documented field names
- exposes an INFO struct that powers the /v1/extractors catalog
First 6 verticals shipped, all hitting public JSON APIs (no HTML
scraping, zero antibot risk):
- reddit → www.reddit.com/*/.json
- hackernews → hn.algolia.com/api/v1/items/{id} (full thread in one call)
- github_repo → api.github.com/repos/{owner}/{repo}
- pypi → pypi.org/pypi/{name}/json
- npm → registry.npmjs.org/{name} + downloads/point/last-week
- huggingface_model → huggingface.co/api/models/{owner}/{name}
Server-side routes added:
- POST /v1/scrape/{vertical} explicit per-vertical extraction
- GET /v1/extractors catalog (name, label, description, url_patterns)
The dispatcher validates that URL matches the requested vertical
before running, so users get "URL doesn't match the X extractor"
instead of opaque parse failures inside the extractor.
17 unit tests cover URL matching + path parsing for each vertical.
Live tests against canonical URLs (rust-lang/rust, requests pypi,
react npm, whisper-large-v3 hf, item 8863 hn, an r/micro_saas post)
all return correct typed JSON in 100-300ms. Sample sizes: github
863B, npm 700B, pypi 1.7KB, hf 3.2KB, hn 38KB (full comment tree).
Marketing positioning: Firecrawl charges 5 credits per /extract call
and you write the schema. Webclaw returns the same JSON in 1 credit
per /scrape/{vertical} call with hand-written deterministic
extractors per site.
2026-04-22 14:11:43 +02:00
|
|
|
pub mod extractors;
|
2026-04-22 21:17:50 +02:00
|
|
|
pub mod fetcher;
|
2026-03-29 16:54:35 +02:00
|
|
|
pub mod linkedin;
|
2026-04-23 12:58:24 +02:00
|
|
|
pub mod locale;
|
2026-03-23 18:31:11 +01:00
|
|
|
pub mod proxy;
|
2026-03-29 16:54:35 +02:00
|
|
|
pub mod reddit;
|
2026-03-23 18:31:11 +01:00
|
|
|
pub mod sitemap;
|
2026-04-01 18:04:55 +02:00
|
|
|
pub mod tls;
|
2026-05-04 11:50:57 +02:00
|
|
|
pub mod url_security;
|
2026-03-23 18:31:11 +01:00
|
|
|
|
|
|
|
|
pub use browser::BrowserProfile;
|
|
|
|
|
pub use client::{BatchExtractResult, BatchResult, FetchClient, FetchConfig, FetchResult};
|
2026-03-25 21:38:28 +01:00
|
|
|
pub use crawler::{CrawlConfig, CrawlResult, CrawlState, Crawler, PageResult};
|
2026-03-23 18:31:11 +01:00
|
|
|
pub use error::FetchError;
|
2026-04-22 21:17:50 +02:00
|
|
|
pub use fetcher::Fetcher;
|
2026-04-01 18:25:40 +02:00
|
|
|
pub use http::HeaderMap;
|
2026-04-23 12:58:24 +02:00
|
|
|
pub use locale::{accept_language_for_tld, accept_language_for_url};
|
2026-03-23 18:31:11 +01:00
|
|
|
pub use proxy::{parse_proxy_file, parse_proxy_line};
|
|
|
|
|
pub use sitemap::SitemapEntry;
|
|
|
|
|
pub use webclaw_pdf::PdfMode;
|