webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-04-25 00:06:21 +02:00

Author	SHA1	Message	Date
Valerio	2ba682adf3	feat(server): add OSS webclaw-server REST API binary (closes #29 ) Self-hosters hitting docs/self-hosting were promised three binaries but the OSS Docker image only shipped two. webclaw-server lived in the closed-source hosted-platform repo, which couldn't be opened. This adds a minimal axum REST API in the OSS repo so self-hosting actually works without pretending to ship the cloud platform. Crate at crates/webclaw-server/. Stateless, no database, no job queue, single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map, batch, extract, summarize, diff, brand}. JSON shapes mirror api.webclaw.io for the endpoints OSS can support, so swapping between self-hosted and hosted is a base-URL change. Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison is constant-time (subtle::ConstantTimeEq). Open mode (no key) is allowed and binds 127.0.0.1 by default; the Docker image flips WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box. Hard caps to keep naive callers from OOMing the process: crawl capped at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent. For unbounded crawls or anti-bot bypass the docs point users at the hosted API. Dockerfile + Dockerfile.ci updated to copy webclaw-server into /usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0 (new public binary).	2026-04-22 12:25:11 +02:00
Valerio	b4bfff120e	fix(docker): entrypoint shim so child images with custom CMD work (#28 ) Some checks failed CI / Test (push) Has been cancelled Details CI / Lint (push) Has been cancelled Details CI / Docs (push) Has been cancelled Details v0.3.13 switched ENTRYPOINT to ["webclaw"] to make `docker run IMAGE https://example.com` work. That broke a different use case: downstream Dockerfiles that `FROM ghcr.io/0xmassi/webclaw` and set their own CMD ["./setup.sh"] — the child's ./setup.sh becomes arg to webclaw, which tries to fetch it as a URL and fails: fetch error: request failed: error sending request for uri (https://./setup.sh): client error (Connect) Both Dockerfile and Dockerfile.ci now use docker-entrypoint.sh which: - forwards flags (-*) and URLs (http://, https://) to `webclaw` - exec's anything else directly Test matrix (all pass locally): docker run IMAGE https://example.com → webclaw scrape ok docker run IMAGE --help → webclaw --help ok docker run IMAGE → default CMD, --help docker run IMAGE bash → bash runs FROM IMAGE + CMD ["./setup.sh"] → setup.sh runs, webclaw available Default CMD is ["webclaw", "--help"] so bare `docker run IMAGE` still prints help. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 15:57:47 +02:00
Valerio	7f0420bbf0	fix(core): UTF-8 char boundary panic in find_content_position (#16 ) (#24 ) `search_from = abs_pos + 1` landed mid-char when a rejected match started on a multi-byte UTF-8 character, panicking on the next `markdown[search_from..]` slice. Advance by `needle.len()` instead — always a valid char boundary, and skips the whole rejected match instead of re-scanning inside it. Repro: webclaw https://bruler.ru/about_brand -f json Before: panic "byte index 782 is not a char boundary; it is inside 'Ч'" After: extracts 2.3KB of clean Cyrillic markdown with 7 sections Two regression tests cover multi-byte rejected matches and all-rejected cycles in Cyrillic text. Closes #16 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 12:02:52 +02:00
Valerio	095ae5d4b1	polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23 ) Some checks are pending CI / Test (push) Waiting to run Details CI / Lint (push) Waiting to run Details CI / Docs (push) Waiting to run Details Three P3 items from the 2026-04-16 audit. Bump to 0.3.17. webclaw-fetch/sitemap.rs: parse_robots_txt used trimmed[..8] slice plus eq_ignore_ascii_case for the directive test. That was fragile: "Sitemap :" (space before colon) fell through silently, inline "# ..." comments leaked into the URL, and a line with no URL at all returned an empty string. Rewritten to split on the first colon, match any-case "sitemap" as the directive name, strip comments, and require `://` in the value. +7 unit tests cover case variants, space-before-colon, comments, empty values, non-URL values, and non-sitemap directives. webclaw-fetch/crawler.rs: is_cancelled uses Ordering::Acquire instead of Relaxed. Behaviourally equivalent on current hardware for single-word atomic loads, but the explicit ordering documents intent for readers + compilers. webclaw-mcp/server.rs: add lazy OnceLock cache for the Firefox FetchClient. Tool calls that repeatedly request the firefox profile without cookies used to build a fresh reqwest pool + TLS stack per call. Chrome (default) already used the long-lived field; Random is per-call by design; cookie-bearing requests still build ad-hoc since the cookie header is part of the client shape. Tests: 85 webclaw-fetch (was 78, +7 new sitemap), 272 webclaw-core, 43 webclaw-llm, 11 CLI — all green. Clippy clean across workspace. Refs: docs/AUDIT-2026-04-16.md P3 section Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 20:21:32 +02:00
Valerio	d69c50a31d	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 ) Some checks are pending CI / Test (push) Waiting to run Details CI / Lint (push) Waiting to run Details CI / Docs (push) Waiting to run Details * feat(fetch,llm): DoS hardening via response caps + glob validation (P2) Response body caps: - webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks Content-Length up front (before the allocation) and the actual .bytes() length after (belt-and-braces against lying upstreams). Previously the HTML -> markdown conversion downstream could allocate multiple String copies per page; a 100 MB page would OOM the process. - webclaw-llm providers (anthropic/openai/ollama) share a new response_json_capped helper with a 5 MB cap. Protects against a malicious or runaway provider response exhausting memory. Crawler frontier cap: after each BFS depth level the frontier is truncated to max(max_pages * 10, 100) entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after max_pages halted new fetches. Glob pattern validation: user-supplied include_patterns / exclude_patterns are rejected at Crawler::new if they contain more than 4 `` wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested `` against long paths. Cleanup: - Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs; no warnings surfaced, the suppression was obsolete. - core/.gitignore: replaced overbroad .json with specific local- artifact patterns (previous rule would have swallowed package.json, components.json, .smithery/.json). Tests: +4 validate_glob tests. Full workspace test: 283 passed (webclaw-core + webclaw-fetch + webclaw-llm). Version: 0.3.15 -> 0.3.16 CHANGELOG updated. Refs: docs/AUDIT-2026-04-16.md (P2 section) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: gitignore CLI research dumps, drop accidentally-tracked file research-.json output from `webclaw ... --research ...` got silently swept into git by the relaxed .json gitignore in the preceding commit. The old blanket .json rule was hiding both this legitimate scratch file AND packages/create-webclaw/server.json (MCP registry config that we DO want tracked). Removes the research dump from git and adds a narrower research-.json ignore pattern so future CLI output doesn't get re-tracked by accident. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 19:44:08 +02:00
Valerio	7773c8af2a	fix(fetch): surface semaphore-closed as typed error instead of panic (P1) (#21 ) Three call sites in webclaw-fetch used .expect("semaphore closed") on `Semaphore::acquire()`. Under normal operation they never fire, but under a shutdown race or adversarial runtime state the spawned task would panic and be silently dropped from the batch / crawl run — the caller would see fewer results than URLs with no indication why. Rewritten to match on the acquire result: - client::fetch_batch and client::fetch_and_extract_batch_with_options now emit BatchResult/BatchExtractResult carrying FetchError::Build("semaphore closed before acquire"). - crawler's inner loop emits a failed PageResult with the same error string instead of panicking. Behaviorally a no-op for the happy path. Fixes the silent-dropped-task class of bug noted in the 2026-04-16 audit. Version: 0.3.14 -> 0.3.15 CHANGELOG updated. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 19:20:26 +02:00
Valerio	1352f48e05	fix(cli): close --on-change command injection via sh -c (P0) (#20 ) * fix(cli): close --on-change command injection via sh -c (P0) The --on-change flag on `webclaw watch` (single-URL, line 1588) and `webclaw watch` multi-URL mode (line 1738) previously handed the entire user-supplied string to `tokio::process::Command::new("sh").arg("-c").arg(cmd)`. Any path that can influence that string — a malicious config file, an MCP client driven by an LLM with prompt-injection exposure, an untrusted environment variable substitution — gets arbitrary shell execution. The command is now tokenized with `shlex::split` (POSIX-ish quoting rules) and executed directly via `Command::new(prog).args(args)`. Metacharacters like `;`, `&&`, `\|`, `$()`, `<(...)`, env expansion, and globbing no longer fire. An explicit opt-in escape hatch is available for users who genuinely need a shell pipeline: `WEBCLAW_ALLOW_SHELL=1` preserves the old `sh -c` path and logs a warning on every invocation so it can't slip in silently. Both call sites now route through a shared `spawn_on_change()` helper. Adds `shlex = "1"` to webclaw-cli dependencies. Version: 0.3.13 -> 0.3.14 CHANGELOG updated. Surfaced by the 2026-04-16 workspace audit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore(brand): fix clippy 1.95 unnecessary_sort_by errors Pre-existing sort_by calls in brand.rs became hard errors under clippy 1.95. Switch to sort_by_key with std::cmp::Reverse. Pure refactor — same ordering, no behavior change. Bundled here so CI goes green on the P0 command-injection fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 18:37:02 +02:00
Valerio	6316b1a6e7	fix: handle raw newlines in JSON-LD strings Some checks are pending CI / Test (push) Waiting to run Details CI / Lint (push) Waiting to run Details CI / Docs (push) Waiting to run Details Sites like Bluesky emit JSON-LD with literal newline characters inside string values (technically invalid JSON). Add sanitize_json_newlines() fallback that escapes control characters inside quoted strings before retrying the parse. This recovers ProfilePage, Product, and other structured data that was previously silently dropped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 11:40:25 +02:00
Valerio	050b2ef463	feat: add allow_subdomains and allow_external_links to CrawlConfig Crawls are same-origin by default. Enable allow_subdomains to follow sibling/child subdomains (blog.example.com from example.com), or allow_external_links for full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au). Includes 5 unit tests for root_domain(). Bump to 0.3.12. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:33:06 +02:00
Valerio	a4c351d5ae	feat: add fallback sitemap paths for broader discovery Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms use non-standard paths that were previously missed. Paths found via robots.txt are deduplicated to avoid double-fetching. Bump to 0.3.11. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 18:22:57 +02:00
Valerio	3cf9dbaf2a	chore: bump to 0.3.9, fix formatting from #14 Version bump for layout table, stack overflow, and noise filter fixes contributed by @devnen. Also fixes cargo fmt issues that caused CI lint failure on the merge commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 15:24:17 +02:00
Valerio	1d2018c98e	fix: MCP research saves to file, returns compact response Research results saved to ~/.webclaw/research/ (report.md + full.json). MCP returns file paths + findings instead of the full report, preventing "exceeds maximum allowed tokens" errors in Claude/Cursor. Same query returns cached result instantly without spending credits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 16:05:45 +02:00
Valerio	f7cc0cc5cf	feat: CLI --research flag + MCP cloud fallback + structured research output - --research "query": deep research via cloud API, saves JSON file with report + sources + findings, prints report to stdout - --deep: longer, more thorough research mode - MCP extract/summarize: cloud fallback when no local LLM available - MCP research: returns structured JSON instead of raw text - Bump to v0.3.7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 14:04:04 +02:00
Valerio	344eea74d9	feat: structured data in markdown/LLM output + v0.3.6 __NEXT_DATA__, SvelteKit, and JSON-LD now appear as a ## Structured Data section in -f markdown and -f llm output. Works with --only-main-content and all extraction flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:16:56 +02:00
Valerio	8d29382b25	feat: extract __NEXT_DATA__ into structured_data Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">. Now extracted as structured JSON (pageProps) in the structured_data field. Tested on 45 sites — 13 return rich structured data including prices, product info, and page state not visible in the DOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:04:51 +02:00
Valerio	84b2e6092e	feat: SvelteKit data extraction + license change to AGPL-3.0 - Extract structured JSON from SvelteKit kit.start() data arrays - Convert JS object literals (unquoted keys) to valid JSON - Data appears in structured_data field (machine-readable) - License changed from MIT to AGPL-3.0 - Bump to v0.3.4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 20:37:56 +02:00
Valerio	aaf51eddef	feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3 Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest) to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles. This removes all 5 [patch.crates-io] entries that consumers previously needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145) are now built directly on wreq's Emulation API with correct TLS options, HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order. 84% pass rate across 1000 real sites. 384 unit tests green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:04:55 +02:00
Valerio	0d0da265ab	chore: bump to v0.3.2, update changelog Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:56:51 +02:00
Valerio	da1d76c97a	feat: add --cookie-file support for JSON cookie files - --cookie-file reads Chrome extension format ([{name, value, domain, ...}]) - Works with EditThisCookie, Cookie-Editor, and similar browser extensions - Merges with --cookie when both provided - MCP scrape tool now accepts cookies parameter - Closes #7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:54:53 +02:00
github-actions[bot]	75e0a9cdef	chore: update webclaw-tls dependencies	2026-03-30 12:03:06 +00:00
github-actions[bot]	b784a3fa1b	chore: update webclaw-tls dependencies	2026-03-30 11:48:44 +00:00
Valerio	199dab6dfa	fix: adapt to webclaw-tls v0.1.1 HeaderMap API change Response.headers() now returns &http::HeaderMap instead of &HashMap<String, String>. Updated FetchResult, is_pdf_content_type, is_document_content_type, is_bot_protected, and all related tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:09:50 +02:00
github-actions[bot]	68b9406ff5	chore: update webclaw-tls dependencies	2026-03-30 09:53:03 +00:00
Valerio	f13cb83c73	feat: replace primp with webclaw-tls, bump to v0.3.0 Replace primp dependency with our own TLS fingerprinting stack (webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match. - Remove primp entirely (zero references remaining) - webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls - Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains) - Skip unknown certificate extensions (SCT tolerance) - 99% bypass rate on 102 sites (was ~85% with primp) - Fixes #5 (HTTPS broken — example.com and similar sites now work) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:40:10 +02:00
Valerio	78810793cf	chore: align Cargo.toml version with v0.2.3 tag	2026-03-27 20:41:02 +01:00
Valerio	341f4737e1	test: v0.2.2 pre-release check	2026-03-27 18:48:15 +01:00
Valerio	76cb6b6cd7	fix: add reqwest to patch list, sync with primp 1.2.0 primp 1.2.0 moved to reqwest 0.13 and now patches reqwest itself (primp-reqwest). Without this patch, cargo install gets vanilla reqwest 0.13 which is missing the HTTP/2 impersonation methods. Users should use: cargo install --locked --git ... Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 18:45:28 +01:00
Valerio	a6be233df9	feat: v0.2.1 — Docker image on GHCR, QuickJS data island extraction - Docker image auto-built on every release via CI - QuickJS sandbox executes inline <script> tags to extract JS-embedded content (window.__PRELOADED_STATE__, self.__next_f, etc.) - Bumped version to 0.2.1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 17:18:31 +01:00
Valerio	81e78963d0	feat: enable quickjs for JS data island extraction webclaw-core has QuickJS behind a feature flag for extracting data from inline <script> tags (window.__PRELOADED_STATE__, self.__next_f, etc). The server was using an old lockfile without the feature enabled. Updated deps to v0.2.0 and explicitly enabled quickjs. This improves extraction on SPAs like NYTimes, Nike, and Bloomberg where content is embedded in JS variable assignments rather than visible DOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 18:50:32 +01:00
Valerio	ea14848772	feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM Document extraction: - DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml) - XLSX/XLS: markdown tables with multi-sheet support (via calamine) - CSV: quoted field handling, markdown table output - All auto-detected by Content-Type header or URL extension New features: - -f html output format (sanitized HTML) - Multi-URL watch: --urls-file + --watch monitors all URLs in parallel - Batch + LLM: --extract-prompt/--extract-json works with multiple URLs - Mixed batch: HTML pages + DOCX + XLSX + CSV in one command Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:28:23 +01:00
Valerio	0e4128782a	fix: v0.1.7 — extraction options now work in batch mode (#3 ) --only-main-content, --include, and --exclude were ignored in batch mode because run_batch used default ExtractionOptions. Added fetch_and_extract_batch_with_options to pass CLI options through. Closes #3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 13:30:20 +01:00
Valerio	1b8dfb77a6	feat: v0.1.6 — watch mode, webhooks (Discord/Slack auto-format) Watch mode: - --watch polls a URL at --watch-interval (default 5min) - Reports diffs to stdout when content changes - --on-change runs a command with diff JSON on stdin - Ctrl+C stops cleanly Webhooks: - --webhook POSTs JSON on crawl/batch complete and watch changes - Auto-detects Discord and Slack URLs, formats as embeds/blocks - Also available via WEBCLAW_WEBHOOK_URL env var - Non-blocking, errors logged to stderr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 12:30:08 +01:00
Valerio	e5649e1824	feat: v0.1.5 — --output-dir saves each page to a separate file Adds --output-dir flag for CLI. Each extracted page gets its own file with filename derived from the URL path. Works with single URL, crawl, and batch modes. CSV input supports custom filenames (url,filename). Root URLs use hostname/index.ext to avoid collisions in batch mode. Subdirectories created automatically from URL path structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 11:02:25 +01:00
Valerio	32c035c543	feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction Embeds QuickJS (rquickjs) to execute inline <script> tags and extract data hidden in JavaScript variable assignments. Captures window.__* objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired), and self.__next_f (Next.js RSC flight data). Results: - NYTimes: 1,552 → 4,162 words (+168%) - Wired: 1,459 → 9,937 words (+580%) - Zero measurable performance overhead (<15ms per page) - Feature-gated: disable with --no-default-features for WASM Smart text filtering rejects CSS, base64, file paths, code strings. Only readable prose is appended under "## Additional Content". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 10:28:16 +01:00
Valerio	0c91c6d5a9	feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support Crawl: - Real-time progress on stderr as pages complete - --crawl-state saves progress on Ctrl+C, resumes from saved state - Visited set + remaining frontier persisted for accurate resume MCP server: - Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars - Falls back to proxies.txt in CWD (existing behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 21:38:28 +01:00
Valerio	c90c0b6066	chore: bump to v0.1.2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 18:44:52 +01:00
Valerio	ea9c783bc5	fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation Critical: - MCP server identifies as "webclaw-mcp" instead of "rmcp" - Research tool poll loop capped at 200 iterations (~10 min) CLI: - Non-zero exit codes on errors - Text format strips markdown table syntax MCP server: - URL validation on all tools - 60s cloud API timeout, 30s local fetch timeout - Diff cloud fallback computes actual diff - Batch capped at 100 URLs, crawl at 500 pages - Graceful startup failure instead of panic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:25:05 +01:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

38 commits