webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-07-21 07:01:01 +02:00

Author	SHA1	Message	Date
Jacob Magar	1c112459bc	fix: error on explicit missing config path; update env.example; add README config docs - config.rs: NoxaConfig::load() now exits with an error when an explicit path (via --config arg or NOXA_CONFIG env var) does not exist; implicit ./config.json missing still silently returns default - Updated test: test_load_missing_file_returns_default replaced with test_load_implicit_missing_file_returns_default (tests None path, not explicit) - env.example: replaced fat legacy content with slim secrets-only template matching .env.example; deleted redundant .env.example - README.md: replaced bare env-var table with full Configuration section covering config.json workflow, precedence, NOXA_CONFIG override, and bool flag limitation	2026-04-11 12:35:21 -04:00
Jacob Magar	10364416c1	chore: slim .env.example to secrets/URLs only	2026-04-11 12:29:05 -04:00
Jacob Magar	9acecba314	chore: add config.example.json and gitignore config.json	2026-04-11 12:29:00 -04:00
Jacob Magar	f22051491f	fix: use resolved config in raw_html selector guard; remove dead path_prefix fallback	2026-04-11 12:28:29 -04:00
Jacob Magar	bac13fc1b5	feat: wire ResolvedConfig into main.rs via clap ValueSource	2026-04-11 12:24:44 -04:00
Jacob Magar	e7583a5c51	docs: clarify doc comments on ResolvedConfig selector and raw_html fields	2026-04-11 12:18:53 -04:00
Jacob Magar	3bc6a9920b	feat: add NoxaConfig and ResolvedConfig with load() Introduces config.rs with NoxaConfig (serde Deserialize, all-optional fields, unknown-field-tolerant), ResolvedConfig (concrete post-merge struct), and NoxaConfig::load() (explicit path > NOXA_CONFIG env > ./config.json, missing file returns default). Also adds Debug derives to OutputFormat, Browser, and PdfModeArg required by NoxaConfig. 4 tests pass.	2026-04-11 12:16:56 -04:00
Jacob Magar	cc1617a3a9	fix(gemini-cli): correct CLI invocation to match gemini v0.36 interface The previous implementation used wrong flags (-p without value, --json, --max-output-tokens) that don't exist in the real gemini CLI. Correct invocation: - Pass prompt as -p STRING value (not via stdin) - Use --output-format json to get structured {response, stats} output - Add --yolo to suppress interactive confirmation prompts - Remove nonexistent --json and --max-output-tokens flags - Parse `.response` field from JSON output, skipping MCP noise lines - Extend timeout from 30s to 60s (agentic CLI is slower than raw API) Smoke tested end-to-end: stdin HTML → summarize and --extract-json both produce correct output via Gemini CLI.	2026-04-11 12:16:21 -04:00
Jacob Magar	cfe455b752	feat: derive Deserialize on OutputFormat, Browser, PdfModeArg	2026-04-11 12:13:25 -04:00
Jacob Magar	af304eda7f	docs(noxa-9fw.4): describe gemini cli as primary llm backend - Update CLAUDE.md: provider chain, LLM modules section, CLI examples - Update env.example: add GEMINI_MODEL, reorder providers (Gemini first) - Update noxa-llm/src/lib.rs crate doc comment	2026-04-11 07:36:19 -04:00
Jacob Magar	993fd6c45d	feat(noxa-9fw.3): validate structured extraction output with one retry - Add jsonschema crate for schema validation in extract_json - On parse failure (invalid JSON): retry once with identical request - On schema mismatch (valid JSON, wrong schema): fail immediately — no retry - validate_schema() produces concise error with field path from instance_path() - Add SequenceMockProvider to testing.rs for first-fail/second-success tests - Fix env var test flakiness: mark env_model_override as ignored	2026-04-11 07:34:58 -04:00
Jacob Magar	420a1d7522	feat(noxa-9fw.2): make gemini cli the primary llm backend - ProviderChain::default() order: Gemini CLI -> OpenAI -> Ollama -> Anthropic - Add --llm-provider gemini arm to build_llm_provider() in noxa-cli - Update unknown-provider error to mention gemini - Update empty-chain error messages in CLI and MCP to mention gemini CLI - Update MCP startup warn! to list gemini CLI as first option	2026-04-11 07:32:24 -04:00
Jacob Magar	d800c37bfd	feat(noxa-9fw.1): add gemini cli provider adapter - Add LlmError::Subprocess(#[from] io::Error) and LlmError::Timeout variants - Implement GeminiCliProvider: new(model) -> Self matching OllamaProvider pattern - Prompts passed exclusively via stdin (Stdio::piped), never as CLI args - 30s subprocess timeout via tokio::time::timeout to prevent hung processes - 6-slot Semaphore to bound concurrent subprocess spawns in MCP context - Stderr captured and included (first 500 bytes) in non-zero exit errors - is_available(): pure `gemini --version` PATH check, no live inference - GEMINI_MODEL env override; default model gemini-2.5-pro - strip_thinking_tags + strip_code_fences applied to stdout output	2026-04-11 07:30:41 -04:00
Jacob Magar	8674b60b4e	chore: rebrand webclaw to noxa	2026-04-11 00:10:38 -04:00
Valerio	a4c351d5ae	feat: add fallback sitemap paths for broader discovery Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms use non-standard paths that were previously missed. Paths found via robots.txt are deduplicated to avoid double-fetching. Bump to 0.3.11. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 18:22:57 +02:00
Valerio	25b6282d5f	style: fix rustfmt for 2-element delay array Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 17:21:53 +02:00
Valerio	954aabe3e8	perf: reduce fetch timeout to 12s and retries to 2 Stress testing showed 33% of proxies are dead, causing 30s+ timeouts per request with 3 retries (worst case 94s). Reducing timeout from 30s to 12s and retries from 3 to 2 brings worst case to 25s. Combined with disabling 509 dead proxies from the pool, this should significantly improve response times under load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 17:18:57 +02:00
Valerio	5ea646a332	fix: resolve clippy warnings from #14 (collapsible_if, manual_inspect) CI runs Rust 1.94 which flags these. Collapsed nested if-let in cell_has_block_content() and replaced .map()+return with .inspect() in table_to_md(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 15:28:59 +02:00
Valerio	3cf9dbaf2a	chore: bump to 0.3.9, fix formatting from #14 Version bump for layout table, stack overflow, and noise filter fixes contributed by @devnen. Also fixes cargo fmt issues that caused CI lint failure on the merge commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 15:24:17 +02:00
Valerio	87ecf4241f	fix: layout tables, stack overflow, and noise filter (#14 ) fix: layout tables rendered as sections instead of markdown tables	2026-04-04 15:20:08 +02:00
devnen	70c67f2ed6	fix: prevent noise filter from swallowing content in malformed HTML Two related fixes for content being stripped by the noise filter: 1. Remove <form> from unconditional noise tags. ASP.NET and similar frameworks wrap entire pages in a <form> tag — these are not input forms. Forms with >500 chars of text are now treated as content wrappers, not noise. 2. Add safety valve for class/ID noise matching. When malformed HTML leaves a noise container unclosed (e.g., <div class="header"> missing its </div>), the HTML5 parser makes all subsequent siblings into children of that container. A header/nav/footer with >5000 chars of text is almost certainly a broken wrapper absorbing real content — exempt it from noise filtering.	2026-04-04 01:38:42 +02:00
devnen	74bac87435	fix: prevent stack overflow on deeply nested HTML pages Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing the default 1 MB main-thread stack on Windows during recursive markdown conversion. Two-layer fix: 1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit 2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so html5ever parsing and extraction both have room on deeply nested pages Tested with Express.co.uk live blog (previously crashed, now extracts 2000+ lines of clean markdown) and drudgereport.com (still works correctly).	2026-04-03 23:45:19 +02:00
devnen	95a6681b02	fix: detect layout tables and render as sections instead of markdown tables Sites like Drudge Report use <table> for page layout, not data. Each cell contains extensive block-level content (divs, hrs, paragraphs, links). Previously, table_to_md() called inline_text() on every cell, collapsing all whitespace and flattening block elements into a single unreadable line. Changes: - Add cell_has_block_content() heuristic: scans for block-level descendants (p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables - Layout tables render each cell as a standalone section separated by blank lines, using children_to_md() to preserve block structure - Data tables (no block elements in cells) keep existing markdown table format - Bold/italic tags containing block elements are treated as containers instead of wrapping in //* (fixes Drudge's <b><font>...</font></b> column wrappers that contain the entire column content) - Add tests for layout tables with paragraphs and with links	2026-04-03 22:24:35 +02:00
Valerio	1d2018c98e	fix: MCP research saves to file, returns compact response Research results saved to ~/.webclaw/research/ (report.md + full.json). MCP returns file paths + findings instead of the full report, preventing "exceeds maximum allowed tokens" errors in Claude/Cursor. Same query returns cached result instantly without spending credits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 16:05:45 +02:00
Valerio	f7cc0cc5cf	feat: CLI --research flag + MCP cloud fallback + structured research output - --research "query": deep research via cloud API, saves JSON file with report + sources + findings, prints report to stdout - --deep: longer, more thorough research mode - MCP extract/summarize: cloud fallback when no local LLM available - MCP research: returns structured JSON instead of raw text - Bump to v0.3.7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 14:04:04 +02:00
Valerio	344eea74d9	feat: structured data in markdown/LLM output + v0.3.6 __NEXT_DATA__, SvelteKit, and JSON-LD now appear as a ## Structured Data section in -f markdown and -f llm output. Works with --only-main-content and all extraction flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:16:56 +02:00
Valerio	b219fc3648	fix(ci): update all 4 Homebrew checksums after Docker build completes Previous approach used mislav/bump-homebrew-formula-action which only updated macOS arm64 SHA. Now downloads all 4 tarballs after Docker finishes, computes SHAs, and writes the complete formula. Fixes #12 (brew install checksum mismatch on Linux) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:02:27 +02:00
Valerio	8d29382b25	feat: extract __NEXT_DATA__ into structured_data Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">. Now extracted as structured JSON (pageProps) in the structured_data field. Tested on 45 sites — 13 return rich structured data including prices, product info, and page state not visible in the DOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:04:51 +02:00
Valerio	4e81c3430d	docs: update npm package license to AGPL-3.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 11:33:43 +02:00
Valerio	c43da982c3	docs: update README license references from MIT to AGPL-3.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 11:28:40 +02:00
Valerio	84b2e6092e	feat: SvelteKit data extraction + license change to AGPL-3.0 - Extract structured JSON from SvelteKit kit.start() data arrays - Convert JS object literals (unquoted keys) to valid JSON - Data appears in structured_data field (machine-readable) - License changed from MIT to AGPL-3.0 - Bump to v0.3.4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 20:37:56 +02:00
Valerio	b4800e681c	ci: fix aarch64 cross-compilation for BoringSSL (boring-sys2) boring-sys2 builds BoringSSL from C source via cmake. For aarch64 cross- compilation, we need g++, cmake, and CC/CXX env vars pointing to the cross-compiler. Also removed stale reqwest_unstable RUSTFLAG. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:39:43 +02:00
Valerio	a1b9a55048	chore: add SKILL.md to repo root for skills.sh discoverability Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:27:17 +02:00
Valerio	124352e0b4	style: cargo fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:25:40 +02:00
Valerio	1a5d3d8aaf	chore: remove reqwest_unstable rustflag (no longer needed) The --cfg reqwest_unstable flag was required by the old patched reqwest. wreq handles everything internally — no special build flags needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:15:05 +02:00
Valerio	11b8f68f51	fix: update Dockerfile for BoringSSL build deps (cmake, clang) wreq uses BoringSSL (via boring-sys2) which needs cmake and clang at build time. Removed stale reference to Impit's patched rustls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:13:18 +02:00
Valerio	aaf51eddef	feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3 Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest) to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles. This removes all 5 [patch.crates-io] entries that consumers previously needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145) are now built directly on wreq's Emulation API with correct TLS options, HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order. 84% pass rate across 1000 real sites. 384 unit tests green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:04:55 +02:00
Valerio	0d0da265ab	chore: bump to v0.3.2, update changelog Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:56:51 +02:00
Valerio	da1d76c97a	feat: add --cookie-file support for JSON cookie files - --cookie-file reads Chrome extension format ([{name, value, domain, ...}]) - Works with EditThisCookie, Cookie-Editor, and similar browser extensions - Merges with --cookie when both provided - MCP scrape tool now accepts cookies parameter - Closes #7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:54:53 +02:00
Valerio	44f23332cc	style: collapse nested if per clippy Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:13:55 +02:00
Valerio	20c810b8d2	chore: bump v0.3.1, update CHANGELOG, fix fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:11:54 +02:00
Valerio	7041a1d992	feat: cookie warmup fallback for Akamai-protected pages When a fetch returns a challenge page (small HTML with Akamai markers), automatically visit the homepage first to collect _abck/bm_sz cookies, then retry the original URL. This bypasses Akamai's cookie-based gate on subpages without needing JS execution. Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr sensor marker on responses under 15KB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:09:31 +02:00
github-actions[bot]	75e0a9cdef	chore: update webclaw-tls dependencies	2026-03-30 12:03:06 +00:00
github-actions[bot]	b784a3fa1b	chore: update webclaw-tls dependencies	2026-03-30 11:48:44 +00:00
Valerio	4cba36337b	style: fix fmt in client.rs test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:18:57 +02:00
Valerio	199dab6dfa	fix: adapt to webclaw-tls v0.1.1 HeaderMap API change Response.headers() now returns &http::HeaderMap instead of &HashMap<String, String>. Updated FetchResult, is_pdf_content_type, is_document_content_type, is_bot_protected, and all related tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:09:50 +02:00
github-actions[bot]	68b9406ff5	chore: update webclaw-tls dependencies	2026-03-30 09:53:03 +00:00
Valerio	31f35fd895	ci: fix ambiguous reqwest version in dependency sync Core has reqwest 0.12 (direct) and 0.13 (via webclaw-tls patch). Disambiguate with version specs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 11:52:35 +02:00
Valerio	4f0c59ac7f	ci: replace stale primp check with webclaw-tls dependency sync Replaces the weekly primp compatibility check (which fails since primp was removed in v0.3.0) with an automated dependency sync workflow. Triggered by webclaw-tls pushes via repository_dispatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 11:39:55 +02:00
Valerio	ee3c714aa9	docs: update CONTRIBUTING.md for v0.3.0 architecture - Replace Impit/primp references with webclaw-tls - Add architecture diagram showing crate layout + TLS repo - Update crate boundaries table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 10:17:26 +02:00

1 2

89 commits