webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-08 22:25:12 +02:00

Author	SHA1	Message	Date
Valerio	3cf9dbaf2a	chore: bump to 0.3.9, fix formatting from #14 Version bump for layout table, stack overflow, and noise filter fixes contributed by @devnen. Also fixes cargo fmt issues that caused CI lint failure on the merge commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 15:24:17 +02:00
Valerio	87ecf4241f	fix: layout tables, stack overflow, and noise filter (#14 ) fix: layout tables rendered as sections instead of markdown tables	2026-04-04 15:20:08 +02:00
devnen	70c67f2ed6	fix: prevent noise filter from swallowing content in malformed HTML Two related fixes for content being stripped by the noise filter: 1. Remove <form> from unconditional noise tags. ASP.NET and similar frameworks wrap entire pages in a <form> tag — these are not input forms. Forms with >500 chars of text are now treated as content wrappers, not noise. 2. Add safety valve for class/ID noise matching. When malformed HTML leaves a noise container unclosed (e.g., <div class="header"> missing its </div>), the HTML5 parser makes all subsequent siblings into children of that container. A header/nav/footer with >5000 chars of text is almost certainly a broken wrapper absorbing real content — exempt it from noise filtering.	2026-04-04 01:38:42 +02:00
devnen	74bac87435	fix: prevent stack overflow on deeply nested HTML pages Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing the default 1 MB main-thread stack on Windows during recursive markdown conversion. Two-layer fix: 1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit 2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so html5ever parsing and extraction both have room on deeply nested pages Tested with Express.co.uk live blog (previously crashed, now extracts 2000+ lines of clean markdown) and drudgereport.com (still works correctly).	2026-04-03 23:45:19 +02:00
devnen	95a6681b02	fix: detect layout tables and render as sections instead of markdown tables Sites like Drudge Report use <table> for page layout, not data. Each cell contains extensive block-level content (divs, hrs, paragraphs, links). Previously, table_to_md() called inline_text() on every cell, collapsing all whitespace and flattening block elements into a single unreadable line. Changes: - Add cell_has_block_content() heuristic: scans for block-level descendants (p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables - Layout tables render each cell as a standalone section separated by blank lines, using children_to_md() to preserve block structure - Data tables (no block elements in cells) keep existing markdown table format - Bold/italic tags containing block elements are treated as containers instead of wrapping in //* (fixes Drudge's <b><font>...</font></b> column wrappers that contain the entire column content) - Add tests for layout tables with paragraphs and with links	2026-04-03 22:24:35 +02:00
Valerio	1d2018c98e	fix: MCP research saves to file, returns compact response Research results saved to ~/.webclaw/research/ (report.md + full.json). MCP returns file paths + findings instead of the full report, preventing "exceeds maximum allowed tokens" errors in Claude/Cursor. Same query returns cached result instantly without spending credits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 16:05:45 +02:00
Valerio	f7cc0cc5cf	feat: CLI --research flag + MCP cloud fallback + structured research output - --research "query": deep research via cloud API, saves JSON file with report + sources + findings, prints report to stdout - --deep: longer, more thorough research mode - MCP extract/summarize: cloud fallback when no local LLM available - MCP research: returns structured JSON instead of raw text - Bump to v0.3.7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 14:04:04 +02:00
Valerio	344eea74d9	feat: structured data in markdown/LLM output + v0.3.6 __NEXT_DATA__, SvelteKit, and JSON-LD now appear as a ## Structured Data section in -f markdown and -f llm output. Works with --only-main-content and all extraction flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:16:56 +02:00
Valerio	b219fc3648	fix(ci): update all 4 Homebrew checksums after Docker build completes Previous approach used mislav/bump-homebrew-formula-action which only updated macOS arm64 SHA. Now downloads all 4 tarballs after Docker finishes, computes SHAs, and writes the complete formula. Fixes #12 (brew install checksum mismatch on Linux) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 19:02:27 +02:00
Valerio	8d29382b25	feat: extract __NEXT_DATA__ into structured_data Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">. Now extracted as structured JSON (pageProps) in the structured_data field. Tested on 45 sites — 13 return rich structured data including prices, product info, and page state not visible in the DOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:04:51 +02:00
Valerio	4e81c3430d	docs: update npm package license to AGPL-3.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 11:33:43 +02:00
Valerio	c43da982c3	docs: update README license references from MIT to AGPL-3.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 11:28:40 +02:00
Valerio	84b2e6092e	feat: SvelteKit data extraction + license change to AGPL-3.0 - Extract structured JSON from SvelteKit kit.start() data arrays - Convert JS object literals (unquoted keys) to valid JSON - Data appears in structured_data field (machine-readable) - License changed from MIT to AGPL-3.0 - Bump to v0.3.4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 20:37:56 +02:00
Valerio	b4800e681c	ci: fix aarch64 cross-compilation for BoringSSL (boring-sys2) boring-sys2 builds BoringSSL from C source via cmake. For aarch64 cross- compilation, we need g++, cmake, and CC/CXX env vars pointing to the cross-compiler. Also removed stale reqwest_unstable RUSTFLAG. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:39:43 +02:00
Valerio	a1b9a55048	chore: add SKILL.md to repo root for skills.sh discoverability Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:27:17 +02:00
Valerio	124352e0b4	style: cargo fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:25:40 +02:00
Valerio	1a5d3d8aaf	chore: remove reqwest_unstable rustflag (no longer needed) The --cfg reqwest_unstable flag was required by the old patched reqwest. wreq handles everything internally — no special build flags needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:15:05 +02:00
Valerio	11b8f68f51	fix: update Dockerfile for BoringSSL build deps (cmake, clang) wreq uses BoringSSL (via boring-sys2) which needs cmake and clang at build time. Removed stale reference to Impit's patched rustls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:13:18 +02:00
Valerio	aaf51eddef	feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3 Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest) to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles. This removes all 5 [patch.crates-io] entries that consumers previously needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145) are now built directly on wreq's Emulation API with correct TLS options, HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order. 84% pass rate across 1000 real sites. 384 unit tests green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 18:04:55 +02:00
Valerio	0d0da265ab	chore: bump to v0.3.2, update changelog Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:56:51 +02:00
Valerio	da1d76c97a	feat: add --cookie-file support for JSON cookie files - --cookie-file reads Chrome extension format ([{name, value, domain, ...}]) - Works with EditThisCookie, Cookie-Editor, and similar browser extensions - Merges with --cookie when both provided - MCP scrape tool now accepts cookies parameter - Closes #7 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:54:53 +02:00
Valerio	44f23332cc	style: collapse nested if per clippy Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:13:55 +02:00
Valerio	20c810b8d2	chore: bump v0.3.1, update CHANGELOG, fix fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:11:54 +02:00
Valerio	7041a1d992	feat: cookie warmup fallback for Akamai-protected pages When a fetch returns a challenge page (small HTML with Akamai markers), automatically visit the homepage first to collect _abck/bm_sz cookies, then retry the original URL. This bypasses Akamai's cookie-based gate on subpages without needing JS execution. Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr sensor marker on responses under 15KB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:09:31 +02:00
github-actions[bot]	75e0a9cdef	chore: update webclaw-tls dependencies	2026-03-30 12:03:06 +00:00
github-actions[bot]	b784a3fa1b	chore: update webclaw-tls dependencies	2026-03-30 11:48:44 +00:00
Valerio	4cba36337b	style: fix fmt in client.rs test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:18:57 +02:00
Valerio	199dab6dfa	fix: adapt to webclaw-tls v0.1.1 HeaderMap API change Response.headers() now returns &http::HeaderMap instead of &HashMap<String, String>. Updated FetchResult, is_pdf_content_type, is_document_content_type, is_bot_protected, and all related tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:09:50 +02:00
github-actions[bot]	68b9406ff5	chore: update webclaw-tls dependencies	2026-03-30 09:53:03 +00:00
Valerio	31f35fd895	ci: fix ambiguous reqwest version in dependency sync Core has reqwest 0.12 (direct) and 0.13 (via webclaw-tls patch). Disambiguate with version specs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 11:52:35 +02:00
Valerio	4f0c59ac7f	ci: replace stale primp check with webclaw-tls dependency sync Replaces the weekly primp compatibility check (which fails since primp was removed in v0.3.0) with an automated dependency sync workflow. Triggered by webclaw-tls pushes via repository_dispatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 11:39:55 +02:00
Valerio	ee3c714aa9	docs: update CONTRIBUTING.md for v0.3.0 architecture - Replace Impit/primp references with webclaw-tls - Add architecture diagram showing crate layout + TLS repo - Update crate boundaries table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 10:17:26 +02:00
Valerio	e3b0d0bd74	fix: make reddit and linkedin modules public for server access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:54:35 +02:00
Valerio	7051d2193b	docs: add v0.3.0 changelog entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:47:04 +02:00
Valerio	f275a93bec	fix: clippy empty-line-after-doc-comment in browser.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:45:05 +02:00
Valerio	140234c139	style: cargo fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:43:11 +02:00
Valerio	f13cb83c73	feat: replace primp with webclaw-tls, bump to v0.3.0 Replace primp dependency with our own TLS fingerprinting stack (webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match. - Remove primp entirely (zero references remaining) - webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls - Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains) - Skip unknown certificate extensions (SCT tolerance) - 99% bypass rate on 102 sites (was ~85% with primp) - Fixes #5 (HTTPS broken — example.com and similar sites now work) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 16:40:10 +02:00
Valerio	77e93441c0	fix(ci): add QEMU for arm64 apt-get in Docker build Plain docker build --platform linux/arm64 on amd64 runner needs QEMU to execute RUN commands. QEMU is only needed for apt-get (seconds), not for Rust compilation (the binaries are pre-built). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:51:51 +01:00
Valerio	8cf021a00b	fix(ci): single Docker job with plain docker build + manifest buildx creates manifest lists per-platform which can't be nested. Use plain docker build for each arch then docker manifest create to combine them. Single job, no matrix, no QEMU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:45:05 +01:00
Valerio	78810793cf	chore: align Cargo.toml version with v0.2.3 tag	2026-03-27 20:41:02 +01:00
Valerio	ef120f6ec7	fix(ci): fix Docker binary path extraction from release tarball Tarball extracts to webclaw-vX.Y.Z-target/ directory, not flat. Use direct cp instead of find. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:39:14 +01:00
Valerio	48a3c45b36	ci: use pre-built binaries for Docker instead of QEMU cross-compilation QEMU arm64 Rust builds took 60+ min and timed out in CI. Now the Docker job downloads the pre-built release binaries and packages them directly. - Dockerfile.ci: slim image for CI (downloads pre-built binaries) - Dockerfile: full source build for local dev (unchanged build stage) - Both use ubuntu:24.04 (GLIBC 2.39 matches CI build environment) - Multi-arch manifest combines amd64 + arm64 images Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 20:32:50 +01:00
Valerio	dfcddd1973	ci: build multi-platform Docker images (amd64 + arm64) Image only had linux/amd64, failing on Apple Silicon Macs with "no matching manifest for linux/arm64/v8". Added QEMU + buildx multi-platform support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 19:18:49 +01:00
Valerio	341f4737e1	test: v0.2.2 pre-release check	2026-03-27 18:48:15 +01:00
Valerio	8b82ad12d0	ci: add weekly primp compatibility check Runs every Monday — updates primp to latest, tries to build. If patches are out of sync the build fails with a clear error pointing to primp's Cargo.toml for the new patch list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 18:45:45 +01:00
Valerio	76cb6b6cd7	fix: add reqwest to patch list, sync with primp 1.2.0 primp 1.2.0 moved to reqwest 0.13 and now patches reqwest itself (primp-reqwest). Without this patch, cargo install gets vanilla reqwest 0.13 which is missing the HTTP/2 impersonation methods. Users should use: cargo install --locked --git ... Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 18:45:28 +01:00
Valerio	2f6255fe6f	add SKILL.md for Claude Code skill integration Adds the webclaw skill definition for Claude Code / Smithery. Located at skill/SKILL.md with proper frontmatter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 17:58:01 +01:00
Valerio	a6be233df9	feat: v0.2.1 — Docker image on GHCR, QuickJS data island extraction - Docker image auto-built on every release via CI - QuickJS sandbox executes inline <script> tags to extract JS-embedded content (window.__PRELOADED_STATE__, self.__next_f, etc.) - Bumped version to 0.2.1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 17:18:31 +01:00
Valerio	b039b99858	ci: add Docker image build to release workflow Pushes ghcr.io/0xmassi/webclaw:latest and :vX.Y.Z on every tagged release. Uses BuildKit cache for fast rebuilds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 16:48:44 +01:00
Valerio	81e78963d0	feat: enable quickjs for JS data island extraction webclaw-core has QuickJS behind a feature flag for extracting data from inline <script> tags (window.__PRELOADED_STATE__, self.__next_f, etc). The server was using an old lockfile without the feature enabled. Updated deps to v0.2.0 and explicitly enabled quickjs. This improves extraction on SPAs like NYTimes, Nike, and Bloomberg where content is embedded in JS variable assignments rather than visible DOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 18:50:32 +01:00

1 2

71 commits