Commit graph

117 commits

Author SHA1 Message Date
devnen
95a6681b02 fix: detect layout tables and render as sections instead of markdown tables
Sites like Drudge Report use <table> for page layout, not data. Each cell
contains extensive block-level content (divs, hrs, paragraphs, links).

Previously, table_to_md() called inline_text() on every cell, collapsing
all whitespace and flattening block elements into a single unreadable line.

Changes:
- Add cell_has_block_content() heuristic: scans for block-level descendants
  (p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables
- Layout tables render each cell as a standalone section separated by blank
  lines, using children_to_md() to preserve block structure
- Data tables (no block elements in cells) keep existing markdown table format
- Bold/italic tags containing block elements are treated as containers
  instead of wrapping in **/**/* (fixes Drudge's <b><font>...</font></b>
  column wrappers that contain the entire column content)
- Add tests for layout tables with paragraphs and with links
2026-04-03 22:24:35 +02:00
Valerio
1d2018c98e fix: MCP research saves to file, returns compact response
Research results saved to ~/.webclaw/research/ (report.md + full.json).
MCP returns file paths + findings instead of the full report, preventing
"exceeds maximum allowed tokens" errors in Claude/Cursor.

Same query returns cached result instantly without spending credits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 16:05:45 +02:00
Valerio
f7cc0cc5cf feat: CLI --research flag + MCP cloud fallback + structured research output
- --research "query": deep research via cloud API, saves JSON file with
  report + sources + findings, prints report to stdout
- --deep: longer, more thorough research mode
- MCP extract/summarize: cloud fallback when no local LLM available
- MCP research: returns structured JSON instead of raw text
- Bump to v0.3.7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 14:04:04 +02:00
Valerio
344eea74d9 feat: structured data in markdown/LLM output + v0.3.6
__NEXT_DATA__, SvelteKit, and JSON-LD now appear as a
## Structured Data section in -f markdown and -f llm output.
Works with --only-main-content and all extraction flags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:16:56 +02:00
Valerio
b219fc3648 fix(ci): update all 4 Homebrew checksums after Docker build completes
Previous approach used mislav/bump-homebrew-formula-action which only
updated macOS arm64 SHA. Now downloads all 4 tarballs after Docker
finishes, computes SHAs, and writes the complete formula.

Fixes #12 (brew install checksum mismatch on Linux)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:02:27 +02:00
Valerio
8d29382b25 feat: extract __NEXT_DATA__ into structured_data
Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">.
Now extracted as structured JSON (pageProps) in the structured_data field.

Tested on 45 sites — 13 return rich structured data including prices,
product info, and page state not visible in the DOM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:04:51 +02:00
Valerio
4e81c3430d docs: update npm package license to AGPL-3.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:33:43 +02:00
Valerio
c43da982c3 docs: update README license references from MIT to AGPL-3.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:28:40 +02:00
Valerio
84b2e6092e feat: SvelteKit data extraction + license change to AGPL-3.0
- Extract structured JSON from SvelteKit kit.start() data arrays
- Convert JS object literals (unquoted keys) to valid JSON
- Data appears in structured_data field (machine-readable)
- License changed from MIT to AGPL-3.0
- Bump to v0.3.4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 20:37:56 +02:00
Valerio
b4800e681c ci: fix aarch64 cross-compilation for BoringSSL (boring-sys2)
boring-sys2 builds BoringSSL from C source via cmake. For aarch64 cross-
compilation, we need g++, cmake, and CC/CXX env vars pointing to the
cross-compiler. Also removed stale reqwest_unstable RUSTFLAG.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:39:43 +02:00
Valerio
a1b9a55048 chore: add SKILL.md to repo root for skills.sh discoverability
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:27:17 +02:00
Valerio
124352e0b4 style: cargo fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:25:40 +02:00
Valerio
1a5d3d8aaf chore: remove reqwest_unstable rustflag (no longer needed)
The --cfg reqwest_unstable flag was required by the old patched reqwest.
wreq handles everything internally — no special build flags needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:15:05 +02:00
Valerio
11b8f68f51 fix: update Dockerfile for BoringSSL build deps (cmake, clang)
wreq uses BoringSSL (via boring-sys2) which needs cmake and clang
at build time. Removed stale reference to Impit's patched rustls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:13:18 +02:00
Valerio
aaf51eddef feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.

This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.

84% pass rate across 1000 real sites. 384 unit tests green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:04:55 +02:00
Valerio
0d0da265ab chore: bump to v0.3.2, update changelog
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 10:56:51 +02:00
Valerio
da1d76c97a feat: add --cookie-file support for JSON cookie files
- --cookie-file reads Chrome extension format ([{name, value, domain, ...}])
- Works with EditThisCookie, Cookie-Editor, and similar browser extensions
- Merges with --cookie when both provided
- MCP scrape tool now accepts cookies parameter
- Closes #7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 10:54:53 +02:00
Valerio
44f23332cc style: collapse nested if per clippy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:13:55 +02:00
Valerio
20c810b8d2 chore: bump v0.3.1, update CHANGELOG, fix fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:11:54 +02:00
Valerio
7041a1d992 feat: cookie warmup fallback for Akamai-protected pages
When a fetch returns a challenge page (small HTML with Akamai markers),
automatically visit the homepage first to collect _abck/bm_sz cookies,
then retry the original URL. This bypasses Akamai's cookie-based gate
on subpages without needing JS execution.

Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr
sensor marker on responses under 15KB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:09:31 +02:00
github-actions[bot]
75e0a9cdef chore: update webclaw-tls dependencies 2026-03-30 12:03:06 +00:00
github-actions[bot]
b784a3fa1b chore: update webclaw-tls dependencies 2026-03-30 11:48:44 +00:00
Valerio
4cba36337b style: fix fmt in client.rs test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:18:57 +02:00
Valerio
199dab6dfa fix: adapt to webclaw-tls v0.1.1 HeaderMap API change
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:09:50 +02:00
github-actions[bot]
68b9406ff5 chore: update webclaw-tls dependencies 2026-03-30 09:53:03 +00:00
Valerio
31f35fd895 ci: fix ambiguous reqwest version in dependency sync
Core has reqwest 0.12 (direct) and 0.13 (via webclaw-tls patch).
Disambiguate with version specs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:52:35 +02:00
Valerio
4f0c59ac7f ci: replace stale primp check with webclaw-tls dependency sync
Replaces the weekly primp compatibility check (which fails since primp
was removed in v0.3.0) with an automated dependency sync workflow.
Triggered by webclaw-tls pushes via repository_dispatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:39:55 +02:00
Valerio
ee3c714aa9 docs: update CONTRIBUTING.md for v0.3.0 architecture
- Replace Impit/primp references with webclaw-tls
- Add architecture diagram showing crate layout + TLS repo
- Update crate boundaries table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 10:17:26 +02:00
Valerio
e3b0d0bd74 fix: make reddit and linkedin modules public for server access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:54:35 +02:00
Valerio
7051d2193b docs: add v0.3.0 changelog entry
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:47:04 +02:00
Valerio
f275a93bec fix: clippy empty-line-after-doc-comment in browser.rs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:45:05 +02:00
Valerio
140234c139 style: cargo fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:43:11 +02:00
Valerio
f13cb83c73 feat: replace primp with webclaw-tls, bump to v0.3.0
Replace primp dependency with our own TLS fingerprinting stack
(webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match.

- Remove primp entirely (zero references remaining)
- webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls
- Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains)
- Skip unknown certificate extensions (SCT tolerance)
- 99% bypass rate on 102 sites (was ~85% with primp)
- Fixes #5 (HTTPS broken — example.com and similar sites now work)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:40:10 +02:00
Valerio
77e93441c0 fix(ci): add QEMU for arm64 apt-get in Docker build
Plain docker build --platform linux/arm64 on amd64 runner needs QEMU
to execute RUN commands. QEMU is only needed for apt-get (seconds),
not for Rust compilation (the binaries are pre-built).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:51:51 +01:00
Valerio
8cf021a00b fix(ci): single Docker job with plain docker build + manifest
buildx creates manifest lists per-platform which can't be nested.
Use plain docker build for each arch then docker manifest create
to combine them. Single job, no matrix, no QEMU.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:45:05 +01:00
Valerio
78810793cf chore: align Cargo.toml version with v0.2.3 tag 2026-03-27 20:41:02 +01:00
Valerio
ef120f6ec7 fix(ci): fix Docker binary path extraction from release tarball
Tarball extracts to webclaw-vX.Y.Z-target/ directory, not flat.
Use direct cp instead of find.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:39:14 +01:00
Valerio
48a3c45b36 ci: use pre-built binaries for Docker instead of QEMU cross-compilation
QEMU arm64 Rust builds took 60+ min and timed out in CI. Now the Docker
job downloads the pre-built release binaries and packages them directly.

- Dockerfile.ci: slim image for CI (downloads pre-built binaries)
- Dockerfile: full source build for local dev (unchanged build stage)
- Both use ubuntu:24.04 (GLIBC 2.39 matches CI build environment)
- Multi-arch manifest combines amd64 + arm64 images

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 20:32:50 +01:00
Valerio
dfcddd1973 ci: build multi-platform Docker images (amd64 + arm64)
Image only had linux/amd64, failing on Apple Silicon Macs with
"no matching manifest for linux/arm64/v8". Added QEMU + buildx
multi-platform support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 19:18:49 +01:00
Valerio
341f4737e1 test: v0.2.2 pre-release check 2026-03-27 18:48:15 +01:00
Valerio
8b82ad12d0 ci: add weekly primp compatibility check
Runs every Monday — updates primp to latest, tries to build.
If patches are out of sync the build fails with a clear error
pointing to primp's Cargo.toml for the new patch list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 18:45:45 +01:00
Valerio
76cb6b6cd7 fix: add reqwest to patch list, sync with primp 1.2.0
primp 1.2.0 moved to reqwest 0.13 and now patches reqwest itself
(primp-reqwest). Without this patch, cargo install gets vanilla
reqwest 0.13 which is missing the HTTP/2 impersonation methods.

Users should use: cargo install --locked --git ...

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 18:45:28 +01:00
Valerio
2f6255fe6f add SKILL.md for Claude Code skill integration
Adds the webclaw skill definition for Claude Code / Smithery.
Located at skill/SKILL.md with proper frontmatter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 17:58:01 +01:00
Valerio
a6be233df9 feat: v0.2.1 — Docker image on GHCR, QuickJS data island extraction
- Docker image auto-built on every release via CI
- QuickJS sandbox executes inline <script> tags to extract JS-embedded
  content (window.__PRELOADED_STATE__, self.__next_f, etc.)
- Bumped version to 0.2.1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 17:18:31 +01:00
Valerio
b039b99858 ci: add Docker image build to release workflow
Pushes ghcr.io/0xmassi/webclaw:latest and :vX.Y.Z on every tagged
release. Uses BuildKit cache for fast rebuilds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 16:48:44 +01:00
Valerio
81e78963d0 feat: enable quickjs for JS data island extraction
webclaw-core has QuickJS behind a feature flag for extracting data from
inline <script> tags (window.__PRELOADED_STATE__, self.__next_f, etc).
The server was using an old lockfile without the feature enabled.

Updated deps to v0.2.0 and explicitly enabled quickjs. This improves
extraction on SPAs like NYTimes, Nike, and Bloomberg where content is
embedded in JS variable assignments rather than visible DOM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 18:50:32 +01:00
Valerio
d67257e931 style: upgrade README badges to for-the-badge style, add X/Twitter
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:30:24 +01:00
Valerio
ea14848772 feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM
Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension

New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:28:23 +01:00
Valerio
0e4128782a fix: v0.1.7 — extraction options now work in batch mode (#3)
--only-main-content, --include, and --exclude were ignored in batch
mode because run_batch used default ExtractionOptions. Added
fetch_and_extract_batch_with_options to pass CLI options through.

Closes #3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 13:30:20 +01:00
Valerio
1b8dfb77a6 feat: v0.1.6 — watch mode, webhooks (Discord/Slack auto-format)
Watch mode:
- --watch polls a URL at --watch-interval (default 5min)
- Reports diffs to stdout when content changes
- --on-change runs a command with diff JSON on stdin
- Ctrl+C stops cleanly

Webhooks:
- --webhook POSTs JSON on crawl/batch complete and watch changes
- Auto-detects Discord and Slack URLs, formats as embeds/blocks
- Also available via WEBCLAW_WEBHOOK_URL env var
- Non-blocking, errors logged to stderr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 12:30:08 +01:00