webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-07 22:15:12 +02:00

Author	SHA1	Message	Date
Valerio	0ab891bd6b	refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch:☁️:*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.	2026-04-22 16:05:44 +02:00
Valerio	3bb0a4bca0	feat(extractors): add LinkedIn + Instagram with profile-to-posts fan-out 3 social-network extractors that work entirely without auth, using public embed/preview endpoints + Instagram's own SEO-facing API: - linkedin_post: /embed/feed/update/{urn} returns full body, author, image, OG tags. Accepts both the urn:li:share and urn:li:activity URN forms plus the pretty /posts/{slug}-{id}-{suffix} URLs. - instagram_post: /p/{shortcode}/embed/captioned/ returns the full caption, username, thumbnail. Same endpoint serves reels and IGTV, kind correctly classified. - instagram_profile: /api/v1/users/web_profile_info/?username=X with the x-ig-app-id header (Instagram's public web-app id, sent by their own JS bundle). Returns the full profile + the 12 most recent posts with shortcodes, kinds, like/comment counts, thumbnails, and caption previews. Falls back to OG-tag scraping of the public HTML if the API ever 401/403s. The IG profile output is shaped so callers can fan out cleanly: for p in profile.recent_posts: scrape('instagram_post', p.url) giving you 'whole profile + every recent post' in one loop. End-to-end tested against ticketswave: 1 profile call + 12 post calls in ~3.5s. Pagination beyond 12 posts requires authenticated cookies and is left for the cloud where we can stash a session. Infrastructure change: added FetchClient::fetch_with_headers so extractors can satisfy site-specific request headers (here x-ig-app-id; later github_pr will use this for Authorization, etc.) without polluting the global FetchConfig.headers map. Same retry semantics as fetch(). Catalog now exposes 17 extractors via /v1/extractors. Total unit tests across the module: 47 passing. Clippy clean. Fmt clean. Live test on the maintainer's example URLs: - LinkedIn post (urn:li:share:7452618582213144577): 'Orc Dev' / full body / shipper.club link / CDN image extracted in 250ms. - Instagram post (DT-RICMjeK5): 835-char Slovak caption, ticketswave username, thumbnail. 200ms. - Instagram profile (ticketswave): 18,473 followers (exact, not rounded), is_verified=True, is_business=True, biography with emojis, 12 recent posts with shortcodes + kinds + likes. 400ms. Out of scope for this wave (require infra we don't have): - linkedin_profile: returns 999 to all bot UAs, needs OAuth - facebook_post / facebook_page: content is JS-loaded, needs cloud Chrome - facebook_profile (personal): not publicly accessible by design	2026-04-22 14:39:49 +02:00
Valerio	aaf51eddef	feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3 Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest) to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles. This removes all 5 [patch.crates-io] entries that consumers previously needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145) are now built directly on wreq's Emulation API with correct TLS options, HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order. 84% pass rate across 1000 real sites. 384 unit tests green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:04:55 +02:00
Valerio	199dab6dfa	fix: adapt to webclaw-tls v0.1.1 HeaderMap API change Response.headers() now returns &http::HeaderMap instead of &HashMap<String, String>. Updated FetchResult, is_pdf_content_type, is_document_content_type, is_bot_protected, and all related tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:09:50 +02:00
Valerio	f13cb83c73	feat: replace primp with webclaw-tls, bump to v0.3.0 Replace primp dependency with our own TLS fingerprinting stack (webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match. - Remove primp entirely (zero references remaining) - webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls - Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains) - Skip unknown certificate extensions (SCT tolerance) - 99% bypass rate on 102 sites (was ~85% with primp) - Fixes #5 (HTTPS broken — example.com and similar sites now work) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:40:10 +02:00
Valerio	ea14848772	feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM Document extraction: - DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml) - XLSX/XLS: markdown tables with multi-sheet support (via calamine) - CSV: quoted field handling, markdown table output - All auto-detected by Content-Type header or URL extension New features: - -f html output format (sanitized HTML) - Multi-URL watch: --urls-file + --watch monitors all URLs in parallel - Batch + LLM: --extract-prompt/--extract-json works with multiple URLs - Mixed batch: HTML pages + DOCX + XLSX + CSV in one command Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:28:23 +01:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

7 commits