webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-07-17 06:21:02 +02:00

Author	SHA1	Message	Date
Valerio	0463b5e263	style: cargo fmt Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 12:03:22 +02:00
Valerio	7f0420bbf0	fix(core): UTF-8 char boundary panic in find_content_position (#16 ) (#24 ) `search_from = abs_pos + 1` landed mid-char when a rejected match started on a multi-byte UTF-8 character, panicking on the next `markdown[search_from..]` slice. Advance by `needle.len()` instead — always a valid char boundary, and skips the whole rejected match instead of re-scanning inside it. Repro: webclaw https://bruler.ru/about_brand -f json Before: panic "byte index 782 is not a char boundary; it is inside 'Ч'" After: extracts 2.3KB of clean Cyrillic markdown with 7 sections Two regression tests cover multi-byte rejected matches and all-rejected cycles in Cyrillic text. Closes #16 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 12:02:52 +02:00
Valerio	1352f48e05	fix(cli): close --on-change command injection via sh -c (P0) (#20 ) * fix(cli): close --on-change command injection via sh -c (P0) The --on-change flag on `webclaw watch` (single-URL, line 1588) and `webclaw watch` multi-URL mode (line 1738) previously handed the entire user-supplied string to `tokio::process::Command::new("sh").arg("-c").arg(cmd)`. Any path that can influence that string — a malicious config file, an MCP client driven by an LLM with prompt-injection exposure, an untrusted environment variable substitution — gets arbitrary shell execution. The command is now tokenized with `shlex::split` (POSIX-ish quoting rules) and executed directly via `Command::new(prog).args(args)`. Metacharacters like `;`, `&&`, `\|`, `$()`, `<(...)`, env expansion, and globbing no longer fire. An explicit opt-in escape hatch is available for users who genuinely need a shell pipeline: `WEBCLAW_ALLOW_SHELL=1` preserves the old `sh -c` path and logs a warning on every invocation so it can't slip in silently. Both call sites now route through a shared `spawn_on_change()` helper. Adds `shlex = "1"` to webclaw-cli dependencies. Version: 0.3.13 -> 0.3.14 CHANGELOG updated. Surfaced by the 2026-04-16 workspace audit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore(brand): fix clippy 1.95 unnecessary_sort_by errors Pre-existing sort_by calls in brand.rs became hard errors under clippy 1.95. Switch to sort_by_key with std::cmp::Reverse. Pure refactor — same ordering, no behavior change. Bundled here so CI goes green on the P0 command-injection fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 18:37:02 +02:00
Valerio	6316b1a6e7	fix: handle raw newlines in JSON-LD strings Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Sites like Bluesky emit JSON-LD with literal newline characters inside string values (technically invalid JSON). Add sanitize_json_newlines() fallback that escapes control characters inside quoted strings before retrying the parse. This recovers ProfilePage, Product, and other structured data that was previously silently dropped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 11:40:25 +02:00
Valerio	5ea646a332	fix: resolve clippy warnings from #14 (collapsible_if, manual_inspect) CI runs Rust 1.94 which flags these. Collapsed nested if-let in cell_has_block_content() and replaced .map()+return with .inspect() in table_to_md(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 15:28:59 +02:00
Valerio	3cf9dbaf2a	chore: bump to 0.3.9, fix formatting from #14 Version bump for layout table, stack overflow, and noise filter fixes contributed by @devnen. Also fixes cargo fmt issues that caused CI lint failure on the merge commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 15:24:17 +02:00
devnen	70c67f2ed6	fix: prevent noise filter from swallowing content in malformed HTML Two related fixes for content being stripped by the noise filter: 1. Remove <form> from unconditional noise tags. ASP.NET and similar frameworks wrap entire pages in a <form> tag — these are not input forms. Forms with >500 chars of text are now treated as content wrappers, not noise. 2. Add safety valve for class/ID noise matching. When malformed HTML leaves a noise container unclosed (e.g., <div class="header"> missing its </div>), the HTML5 parser makes all subsequent siblings into children of that container. A header/nav/footer with >5000 chars of text is almost certainly a broken wrapper absorbing real content — exempt it from noise filtering.	2026-04-04 01:38:42 +02:00
devnen	74bac87435	fix: prevent stack overflow on deeply nested HTML pages Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing the default 1 MB main-thread stack on Windows during recursive markdown conversion. Two-layer fix: 1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit 2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so html5ever parsing and extraction both have room on deeply nested pages Tested with Express.co.uk live blog (previously crashed, now extracts 2000+ lines of clean markdown) and drudgereport.com (still works correctly).	2026-04-03 23:45:19 +02:00
devnen	95a6681b02	fix: detect layout tables and render as sections instead of markdown tables Sites like Drudge Report use <table> for page layout, not data. Each cell contains extensive block-level content (divs, hrs, paragraphs, links). Previously, table_to_md() called inline_text() on every cell, collapsing all whitespace and flattening block elements into a single unreadable line. Changes: - Add cell_has_block_content() heuristic: scans for block-level descendants (p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables - Layout tables render each cell as a standalone section separated by blank lines, using children_to_md() to preserve block structure - Data tables (no block elements in cells) keep existing markdown table format - Bold/italic tags containing block elements are treated as containers instead of wrapping in //* (fixes Drudge's <b><font>...</font></b> column wrappers that contain the entire column content) - Add tests for layout tables with paragraphs and with links	2026-04-03 22:24:35 +02:00
Valerio	344eea74d9	feat: structured data in markdown/LLM output + v0.3.6 __NEXT_DATA__, SvelteKit, and JSON-LD now appear as a ## Structured Data section in -f markdown and -f llm output. Works with --only-main-content and all extraction flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:16:56 +02:00
Valerio	8d29382b25	feat: extract __NEXT_DATA__ into structured_data Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">. Now extracted as structured JSON (pageProps) in the structured_data field. Tested on 45 sites — 13 return rich structured data including prices, product info, and page state not visible in the DOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:04:51 +02:00
Valerio	84b2e6092e	feat: SvelteKit data extraction + license change to AGPL-3.0 - Extract structured JSON from SvelteKit kit.start() data arrays - Convert JS object literals (unquoted keys) to valid JSON - Data appears in structured_data field (machine-readable) - License changed from MIT to AGPL-3.0 - Bump to v0.3.4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 20:37:56 +02:00
Valerio	32c035c543	feat: v0.1.4 — QuickJS integration for inline JavaScript data extraction Embeds QuickJS (rquickjs) to execute inline <script> tags and extract data hidden in JavaScript variable assignments. Captures window.__* objects like __preloadedData (NYTimes), __PRELOADED_STATE__ (Wired), and self.__next_f (Next.js RSC flight data). Results: - NYTimes: 1,552 → 4,162 words (+168%) - Wired: 1,459 → 9,937 words (+580%) - Zero measurable performance overhead (<15ms per page) - Feature-gated: disable with --no-default-features for WASM Smart text filtering rejects CSS, base64, file paths, code strings. Only readable prose is appended under "## Additional Content". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 10:28:16 +01:00
Valerio	afe4d3077d	feat: v0.1.2 — TLS fallback, Safari default, Reddit fix, YouTube transcript infra - Switch default profile to Safari26/Mac (best CF pass rate) - Auto-fallback to plain client on connection error or 403 - Fixes: ycombinator.com, producthunt.com, and similar CF-strict sites - Reddit .json endpoint uses plain client (TLS fingerprint was blocked) - YouTube caption track extraction + timed text parser (core, not yet wired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 18:50:07 +01:00
Valerio	ea9c783bc5	fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation Critical: - MCP server identifies as "webclaw-mcp" instead of "rmcp" - Research tool poll loop capped at 200 iterations (~10 min) CLI: - Non-zero exit codes on errors - Text format strips markdown table syntax MCP server: - URL validation on all tools - 60s cloud API timeout, 30s local fetch timeout - Diff cloud fallback computes actual diff - Batch capped at 100 URLs, crawl at 500 pages - Graceful startup failure instead of panic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:25:05 +01:00
Valerio	c99ec684fa	Initial release: webclaw v0.1.0 — web content extraction for LLMs CLI + MCP server for extracting clean, structured content from any URL. 6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats. MIT Licensed \| https://webclaw.io	2026-03-23 18:31:11 +01:00

16 commits