webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-10 22:45:13 +02:00

Author	SHA1	Message	Date
devnen	ade2a5143c	feat(core): --mode sections for nav-URL discovery Section-URL ambiguity is recurring friction — callers have to guess whether to hit infobae.com root (LATAM frontpage) or /economia/ (AR- specific live FX dashboard), or decrypt.co root (ticker ribbon) vs /news/ (article list), or bbc.com/news/world vs /news/world/europe/. Each guess costs a round-trip. New `--mode sections` returns the discoverable section URLs parsed from the page's nav, in one round-trip. Subsumes issue #16 (non- English nav harder to LLM-parse — sections come back as data, not prose). Multi-signal heuristic on the existing link extraction: URL-pattern match (/<category>/ style short paths), repetition (section links appear in header + footer), DOM-position when available. Fallback when zero sections detected: emit top-N links with a "(none detected; first N shown)" note. Format: -f llm/text emits `Sections:` followed by `- [Label](url)` list. -f json emits `{"sections": [{"label": "...", "url": "..."}]}`. 13 new tests in webclaw-core (688 -> 701).	2026-05-23 23:14:40 +02:00
devnen	76cd515a3e	feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites On sites like Hollywood Reporter where the extracted body is < 500 words because the page is JS-walled (chrome rendering is needed), webclaw now emits a one-line stderr hint: # hint: extracted body is N words (thin); the page may be JS-walled. Try --browser chrome for JS-rendered content. Thin-body classification (crates/webclaw-core/src/llm/thin_body.rs) mirrors the M2 hub-detector structure. Threshold: 500 words. Exemption list for utility domains (example.com, httpbin.org, etc) where thinness is by design. The originally proposed --retry-thin flag was dropped after phase A determined webclaw has no headless-JS backend to retry to (--browser only affects User-Agent impersonation, not actual rendering). The hint-only design lets the caller decide: re-run with --browser chrome manually, or switch to a different fetcher entirely. Hint suppressed in --mode summary / --mode toc (link/outline focused); M3 fast-fails skip the formatter entirely so no hint. Stdout invariance: tested byte-identical on all p01-p15 default probes. M10 only modifies stderr. 10 new tests (workspace 678 -> 688).	2026-05-23 22:18:12 +02:00
devnen	dfcd51d9e0	feat(core): HTTP status header line in -f llm/text/json output Webclaw previously emitted URL, Title, Description, and Word count in the -f llm header but no HTTP status. On a 404 response, the caller had no signal apart from inspecting the body (e.g. dailysabah.com/ business/economy returns a 404 page; webclaw was extracting '13 words' of the error page without flagging the 404 status). New behavior: every -f llm/text/json output includes a 'Status: <code>' header line (after URL: per phase A's placement). Emitted on all responses including 200 for consistency — callers can't otherwise distinguish 'webclaw saw 200' from 'webclaw missed status info'. For -f json: top-level "status": <code> field added. Modes --mode summary and --mode toc are exempt: the status line would clutter the link-list and outline outputs. M3 fast-fails (known-bad-sites) also skip the status line because they exit before the formatter is reached. 7 new tests in webclaw-core (workspace total 671 -> 678).	2026-05-23 21:29:26 +02:00
devnen	66974366d7	feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld JSON-LD is consistently the cleanest source on major outlets (Reuters, BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured Data block at the bottom of -f llm output; this iter teaches it to parse the JSON-LD by schema and surface it usefully. New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review, WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is auto-lifted (Reuters CollectionPage shape). Two new CLI flags: --prefer-structured: surfaces the schema-aware block at the TOP of the output, before prose. For -f llm emits a Markdown summary block; for -f json emits a {structured, extracted} envelope. Bypasses the default DROP list for WebPage/chrome types when explicitly requested. --articles-from-jsonld: when the page contains ItemList or LiveBlogPosting, output ONLY a JSON array of articles ({position, title, url, published}). When no such schema is present, emit a stderr hint and fall through to default extraction (no error). Default behavior (neither flag set) byte-identical to iter-3 on all default-flag probes (regression sentinel passed): Cyrillic p14 still 7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical, M3 registry p44/p45/p46 still fast-fail with exit 67. 14 new tests in webclaw-core covering schema-variant parsing, parse error handling, fall-through behavior, flag combinations, and the default-byte-identical sentinel. Workspace tests 657 -> 671.	2026-05-23 20:38:59 +02:00
devnen	e28b22adf7	feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls Sites known to require CAPTCHA-solving (Cloudflare interstitials) or browser-side ad-blocker bypass (JS+adblock walls like Liberation) cannot be reached by webclaw's chrome impersonation; they return interstitial stubs ('Just a moment...', 'Please enable JS and disable any ad blocker') with 0 useful content. Currently each call wastes 5-10s on the timeout before the caller sees the failure. New registry under crates/webclaw-fetch/src/known_bad_sites.rs lists known bad hosts with a category (CloudflareInterstitial / AdblockWall) and suggested substitute domains. Host matching: lowercase + strip leading 'www.' + exact-match against registered host. On registry hit, webclaw writes 'error: <host> is <category>-walled; suggested substitute: <alt1>, <alt2>' to stderr and exits with code 67 (EX_NOHOST), BEFORE making any network call. wall_ms drops from ~5000 to <50 for listed hosts. Initial entries: ambito.com (Cloudflare; substitutes cronista.com, iprofesional.com), liberation.fr (adblock; substitutes lemonde.fr, lepoint.fr). WSJ/FT/Bloomberg/NYT are NOT included -- those are subscription paywalls with different bypass semantics; deferred to M11. 10 new tests in webclaw-fetch covering host normalization, www stripping, path-under-host matching, case insensitivity, unknown-domain pass-through, and the formatted error message (9 unit + 1 fetch-layer integration). Workspace test total 647 -> 657.	2026-05-23 19:42:15 +02:00
devnen	31a8f6150f	feat(core): JS-hub page detector + --prefer-articles flag Detects ESPN-style hub pages (espn.com/nba/, /nfl/, /mlb/, /nhl/, /soccer/) where the rendered markup has nav-only content with no article bodies — chrome retry doesn't help because the data genuinely isn't in the markup. Heuristic: word_count < 500 AND link_count >= 5 against the extracted output. --prefer-articles: when set, a hub-classified page returns the extracted link list (reusing the M1 --mode summary machinery) instead of the sparse body. On non-hub pages, behavior is unchanged. stderr hint: always emitted on hub detection so the caller knows to drill /story/_/id/<id>/ URLs from a citation list. False-positive resistance verified: BBC News /world (link-heavy aggregator, 1500+ words body) and n1info.rs (widget-heavy but content-rich) both classify as non-hub and emit full extraction. 9 new tests in webclaw-core (317 -> 326).	2026-05-23 18:55:17 +02:00
devnen	339f41bb7c	feat(cli): add --max-output-bytes and --mode summary,toc for output-size control Three additive CLI flags addressing the 50KB persisted-output cap that trips Claude Code's per-tool-result harness on aggregator front pages (apnews.com, cnbc.com/markets/, b92.net all >50KB by default): --max-output-bytes N: truncates final output at N bytes with a clear '[truncated: M more bytes ...]' footer. N=0 means unlimited (default). UTF-8 codepoint-boundary safe; also wraps JSON output so truncated output stays parseable. --mode summary: returns only the extracted link list (titles + URLs), no body text. For aggregator front pages where the LLM is going to drill the individual articles next anyway. --mode toc: returns H1/H2 outline + first paragraph after each H2. For long single-article pages. New flags are orthogonal to -f (json/llm/text). 9 new unit tests in webclaw-core, total goes 308 -> 317 passing. Smoke-tested on apnews.com (51713 -> 27404 summary -> 6269 toc -> 8193 capped), pitchfork.com (42049 -> 379 summary), cnbc.com (56682 -> 16385 capped).	2026-05-23 18:17:42 +02:00
Valerio	be8bcfebd9	fix: harden resource limits, path safety, and WASM build (#46 ) Security audit follow-up across the workspace: - webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a cfg(not(wasm32)) target dependency and the extraction entry point uses a direct call on wasm instead of spawning a thread, so it builds and runs on wasm32 with or without default features. - webclaw-core: bound the structured-data scrubber recursion (depth cap) so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the stack. - webclaw-fetch: stream the response body with a running ceiling so a small highly compressed payload cannot inflate to gigabytes in memory; redact user:pass@ from proxy URLs before they reach error strings. - webclaw-cli: contain output filenames inside the chosen directory (reject .. / absolute, drop traversal path segments), run --webhook URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s, and make research slug truncation char-safe. - webclaw-mcp: char-safe slug truncation (no multibyte slice panic). - setup.sh / deploy/hetzner.sh: replace eval on read input with printf -v, and mask auth key / API token in console output. - CI: enforce the wasm32 build invariant for webclaw-core. Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.	2026-05-19 17:03:52 +02:00
Valerio	3fabdc1d02	fix: clean llm output noise Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.	2026-05-18 18:39:33 +02:00
Valerio	a611ae26f3	fix(security): harden local fetch surfaces	2026-05-12 12:00:25 +02:00
Valerio	a3aa4bce6f	fix: support LLM provider compatibility options Closes #36	2026-05-06 11:36:53 +02:00
Valerio	98a177dec4	feat(cli): expose safari-ios browser profile + bump to 0.5.5	2026-04-23 13:32:55 +02:00
Valerio	0daa2fec1a	feat(cli+mcp): vertical extractor support (28 extractors discoverable + callable) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Wires the vertical extractor catalog into both the CLI and the MCP server so users don't have to hit the HTTP API to invoke them. Same semantics as `/v1/scrape/{vertical}` + `/v1/extractors`. CLI (webclaw-cli): - New subcommand `webclaw extractors` lists all 28 extractors with name, label, and sample URL. `--json` flag emits the full catalog as machine-readable JSON. - New subcommand `webclaw vertical <name> <url>` runs a specific extractor and prints typed JSON. Pretty-printed by default; `--raw` for single-line. Exits 1 with a clear "URL does not match" error on mismatch. - FetchClient built with Firefox profile + cloud fallback attached when WEBCLAW_API_KEY is set, so antibot-gated verticals escalate. MCP (webclaw-mcp): - New tool `list_extractors` (no args) returns the catalog as pretty-printed JSON for in-session discovery. - New tool `vertical_scrape` takes `{name, url}` and returns typed JSON. Reuses the long-lived self.fetch_client. - Tool count goes from 10 to 12. Server-info instruction string updated accordingly. Tests: 215 passing, clippy clean. Manual surface-tested end-to-end: CLI prints real Reddit/github/pypi data; MCP JSON-RPC session returns 28-entry catalog + typed responses for pypi/requests + rust-lang/rust in 200-400ms. Version bumped to 0.5.2 (minor for API additions, backwards compatible).	2026-04-22 21:41:15 +02:00
Valerio	0ab891bd6b	refactor(cloud): consolidate CloudClient + smart_fetch into webclaw-fetch The local-first / cloud-fallback flow was duplicated in two places: - webclaw-mcp/src/cloud.rs (302 lines, canonical) - webclaw-cli/src/cloud.rs (80 lines, minimal subset kept to avoid pulling rmcp as a dep) Move to the shared crate where all vertical extractors and the new webclaw-server can also reach it. ## New module: webclaw-fetch/src/cloud.rs Single canonical home. Consolidates both previous versions and promotes the error type from stringy to typed: - `CloudError` enum with dedicated variants for the four HTTP outcomes callers act on differently — 401 (key rejected), 402 (insufficient plan), 429 (rate limited), plus ServerError / Network / ParseFailed. Each variant's Display message ends with an actionable URL (signup / pricing / dashboard) so API consumers can surface it verbatim. - `From<CloudError> for String` bridge so the dozen existing `.await?` call sites in MCP / CLI that expected `Result<_, String>` keep compiling. We can migrate them to the typed error per-site later without a churn commit. - `CloudClient::new(Option<&str>)` matches the CLI's `--api-key` flag pattern (explicit key wins, env fallback, None when empty). `::from_env()` kept for MCP-style call sites. - `with_key_and_base` for staging / integration tests. - `scrape / post / get / fetch_html` — `fetch_html` is new, a convenience that calls /v1/scrape with formats=["html"] and returns the raw HTML string so vertical extractors can plug antibot-bypassed HTML straight into their parsers. - `is_bot_protected` + `needs_js_rendering` detectors moved over verbatim. Detection patterns are public (CF / DataDome / AWS WAF challenge-page signatures) — no moat leak. - `smart_fetch` kept on the original `Result<_, String>` signature so MCP's six call sites compile unchanged. - `smart_fetch_html` is new: the local-first-then-cloud flow for the vertical-extractor pattern, returning the typed `CloudError` so extractors can emit precise upgrade-path messages. ## Cleanup - Deleted webclaw-mcp/src/cloud.rs — all imports now resolve to `webclaw_fetch:☁️:*`. Dropped reqwest as a direct dep of webclaw-mcp (it only used it for the old cloud client). - Deleted webclaw-cli/src/cloud.rs. CLI keeps reqwest for its webhook / on-change / research HTTP calls. - webclaw-fetch now has reqwest as a direct dep. It was already transitively pulled in by webclaw-llm; this just makes the dependency relationship explicit at the call site. ## Tests 16 new unit tests cover: - CloudError status mapping (401/402/429/5xx) - NotConfigured error includes signup URL - CloudClient::new explicit-key-wins-over-env + empty-string = None - base_url strips trailing slash - Detector matrix (CF challenge / Turnstile / real content with embedded Turnstile / SPA skeleton / real article with script tags) - truncate respects char boundaries (don't slice inside UTF-8) Full workspace test suite still passes (~500 tests). fmt + clippy clean. No behavior change for existing MCP / CLI call sites.	2026-04-22 16:05:44 +02:00
Valerio	d91ad9c1f4	feat(cli): add webclaw bench <url> subcommand (closes #26 ) Per-URL extraction micro-benchmark. Fetches a URL once, runs the same pipeline as --format llm, prints a small ASCII table comparing raw HTML vs. llm output on tokens, bytes, and extraction time. webclaw bench https://stripe.com # ASCII table webclaw bench https://stripe.com --json # one-line JSON webclaw bench https://stripe.com --facts FILE # adds fidelity row The --facts file uses the same schema as benchmarks/facts.json (curated visible-fact list per URL). URLs not in the file produce no fidelity row, so an uncurated site doesn't show 0/0. v1 uses an approximate tokenizer (chars/4 Latin, chars/2 when CJK dominates). Off by ~10% vs cl100k_base but the signal — 'is the LLM output 90% smaller than the raw HTML' — is order-of-magnitude, not precise accounting. Output is labeled '~ tokens' so nobody mistakes it for a real BPE count. Swapping in tiktoken-rs later is a one function change; left out of v1 to avoid the 2 MB BPE-data binary bloat for a feature most users will run a handful of times. Implemented as a real clap subcommand (clap::Subcommand) rather than yet another flag, with the existing flag-based flow falling through when no subcommand is given. Existing 'webclaw <url> --format ...' invocations work exactly as before. Lays the groundwork for future subcommands without disrupting the legacy flat-flag UX. 12 new unit tests cover the tokenizer, formatters, host extraction, and fact-matching. Verified end-to-end on example.com and tavily.com (5/5 facts preserved at 93% token reduction).	2026-04-22 12:25:29 +02:00
Valerio	d69c50a31d	feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run * feat(fetch,llm): DoS hardening via response caps + glob validation (P2) Response body caps: - webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks Content-Length up front (before the allocation) and the actual .bytes() length after (belt-and-braces against lying upstreams). Previously the HTML -> markdown conversion downstream could allocate multiple String copies per page; a 100 MB page would OOM the process. - webclaw-llm providers (anthropic/openai/ollama) share a new response_json_capped helper with a 5 MB cap. Protects against a malicious or runaway provider response exhausting memory. Crawler frontier cap: after each BFS depth level the frontier is truncated to max(max_pages * 10, 100) entries, keeping the most recently discovered links. Dense pages (tag clouds, search results) used to push the frontier into the tens of thousands even after max_pages halted new fetches. Glob pattern validation: user-supplied include_patterns / exclude_patterns are rejected at Crawler::new if they contain more than 4 `` wildcards or exceed 1024 chars. The backtracking matcher degrades exponentially on deeply-nested `` against long paths. Cleanup: - Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs; no warnings surfaced, the suppression was obsolete. - core/.gitignore: replaced overbroad .json with specific local- artifact patterns (previous rule would have swallowed package.json, components.json, .smithery/.json). Tests: +4 validate_glob tests. Full workspace test: 283 passed (webclaw-core + webclaw-fetch + webclaw-llm). Version: 0.3.15 -> 0.3.16 CHANGELOG updated. Refs: docs/AUDIT-2026-04-16.md (P2 section) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: gitignore CLI research dumps, drop accidentally-tracked file research-.json output from `webclaw ... --research ...` got silently swept into git by the relaxed .json gitignore in the preceding commit. The old blanket .json rule was hiding both this legitimate scratch file AND packages/create-webclaw/server.json (MCP registry config that we DO want tracked). Removes the research dump from git and adds a narrower research-.json ignore pattern so future CLI output doesn't get re-tracked by accident. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 19:44:08 +02:00
Valerio	1352f48e05	fix(cli): close --on-change command injection via sh -c (P0) (#20 ) * fix(cli): close --on-change command injection via sh -c (P0) The --on-change flag on `webclaw watch` (single-URL, line 1588) and `webclaw watch` multi-URL mode (line 1738) previously handed the entire user-supplied string to `tokio::process::Command::new("sh").arg("-c").arg(cmd)`. Any path that can influence that string — a malicious config file, an MCP client driven by an LLM with prompt-injection exposure, an untrusted environment variable substitution — gets arbitrary shell execution. The command is now tokenized with `shlex::split` (POSIX-ish quoting rules) and executed directly via `Command::new(prog).args(args)`. Metacharacters like `;`, `&&`, `\|`, `$()`, `<(...)`, env expansion, and globbing no longer fire. An explicit opt-in escape hatch is available for users who genuinely need a shell pipeline: `WEBCLAW_ALLOW_SHELL=1` preserves the old `sh -c` path and logs a warning on every invocation so it can't slip in silently. Both call sites now route through a shared `spawn_on_change()` helper. Adds `shlex = "1"` to webclaw-cli dependencies. Version: 0.3.13 -> 0.3.14 CHANGELOG updated. Surfaced by the 2026-04-16 workspace audit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore(brand): fix clippy 1.95 unnecessary_sort_by errors Pre-existing sort_by calls in brand.rs became hard errors under clippy 1.95. Switch to sort_by_key with std::cmp::Reverse. Pure refactor — same ordering, no behavior change. Bundled here so CI goes green on the P0 command-injection fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 18:37:02 +02:00
Valerio	050b2ef463	feat: add allow_subdomains and allow_external_links to CrawlConfig Crawls are same-origin by default. Enable allow_subdomains to follow sibling/child subdomains (blog.example.com from example.com), or allow_external_links for full cross-origin crawling. Root domain extraction uses a heuristic that handles two-part TLDs (co.uk, com.au). Includes 5 unit tests for root_domain(). Bump to 0.3.12. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:33:06 +02:00
Valerio	f7cc0cc5cf	feat: CLI --research flag + MCP cloud fallback + structured research output - --research "query": deep research via cloud API, saves JSON file with report + sources + findings, prints report to stdout - --deep: longer, more thorough research mode - MCP extract/summarize: cloud fallback when no local LLM available - MCP research: returns structured JSON instead of raw text - Bump to v0.3.7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 14:04:04 +02:00
Valerio	344eea74d9	feat: structured data in markdown/LLM output + v0.3.6 __NEXT_DATA__, SvelteKit, and JSON-LD now appear as a ## Structured Data section in -f markdown and -f llm output. Works with --only-main-content and all extraction flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:16:56 +02:00
Valerio	da1d76c97a	feat: add --cookie-file support for JSON cookie files - --cookie-file reads Chrome extension format ([{name, value, domain, ...}]) - Works with EditThisCookie, Cookie-Editor, and similar browser extensions - Merges with --cookie when both provided - MCP scrape tool now accepts cookies parameter - Closes #7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:54:53 +02:00
Valerio	f13cb83c73	feat: replace primp with webclaw-tls, bump to v0.3.0 Replace primp dependency with our own TLS fingerprinting stack (webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match. - Remove primp entirely (zero references remaining) - webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls - Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains) - Skip unknown certificate extensions (SCT tolerance) - 99% bypass rate on 102 sites (was ~85% with primp) - Fixes #5 (HTTPS broken — example.com and similar sites now work) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:40:10 +02:00
Valerio	ea14848772	feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM Document extraction: - DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml) - XLSX/XLS: markdown tables with multi-sheet support (via calamine) - CSV: quoted field handling, markdown table output - All auto-detected by Content-Type header or URL extension New features: - -f html output format (sanitized HTML) - Multi-URL watch: --urls-file + --watch monitors all URLs in parallel - Batch + LLM: --extract-prompt/--extract-json works with multiple URLs - Mixed batch: HTML pages + DOCX + XLSX + CSV in one command Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:28:23 +01:00
Valerio	0e4128782a	fix: v0.1.7 — extraction options now work in batch mode (#3 ) --only-main-content, --include, and --exclude were ignored in batch mode because run_batch used default ExtractionOptions. Added fetch_and_extract_batch_with_options to pass CLI options through. Closes #3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 13:30:20 +01:00
Valerio	1b8dfb77a6	feat: v0.1.6 — watch mode, webhooks (Discord/Slack auto-format) Watch mode: - --watch polls a URL at --watch-interval (default 5min) - Reports diffs to stdout when content changes - --on-change runs a command with diff JSON on stdin - Ctrl+C stops cleanly Webhooks: - --webhook POSTs JSON on crawl/batch complete and watch changes - Auto-detects Discord and Slack URLs, formats as embeds/blocks - Also available via WEBCLAW_WEBHOOK_URL env var - Non-blocking, errors logged to stderr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 12:30:08 +01:00
Valerio	e5649e1824	feat: v0.1.5 — --output-dir saves each page to a separate file Adds --output-dir flag for CLI. Each extracted page gets its own file with filename derived from the URL path. Works with single URL, crawl, and batch modes. CSV input supports custom filenames (url,filename). Root URLs use hostname/index.ext to avoid collisions in batch mode. Subdirectories created automatically from URL path structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 11:02:25 +01:00
Valerio	0c91c6d5a9	feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support Crawl: - Real-time progress on stderr as pages complete - --crawl-state saves progress on Ctrl+C, resumes from saved state - Visited set + remaining frontier persisted for accurate resume MCP server: - Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars - Falls back to proxies.txt in CWD (existing behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 21:38:28 +01:00
Valerio	b92c0ed186	style: fix cargo fmt formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:27:15 +01:00
Valerio	ea9c783bc5	fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation Critical: - MCP server identifies as "webclaw-mcp" instead of "rmcp" - Research tool poll loop capped at 200 iterations (~10 min) CLI: - Non-zero exit codes on errors - Text format strips markdown table syntax MCP server: - URL validation on all tools - 60s cloud API timeout, 30s local fetch timeout - Diff cloud fallback computes actual diff - Batch capped at 100 URLs, crawl at 500 pages - Graceful startup failure instead of panic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:25:05 +01:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

30 commits