Commit graph

16 commits

Author SHA1 Message Date
Valerio
d91ad9c1f4 feat(cli): add webclaw bench <url> subcommand (closes #26)
Per-URL extraction micro-benchmark. Fetches a URL once, runs the same
pipeline as --format llm, prints a small ASCII table comparing raw
HTML vs. llm output on tokens, bytes, and extraction time.

  webclaw bench https://stripe.com               # ASCII table
  webclaw bench https://stripe.com --json        # one-line JSON
  webclaw bench https://stripe.com --facts FILE  # adds fidelity row

The --facts file uses the same schema as benchmarks/facts.json (curated
visible-fact list per URL). URLs not in the file produce no fidelity
row, so an uncurated site doesn't show 0/0.

v1 uses an approximate tokenizer (chars/4 Latin, chars/2 when CJK
dominates). Off by ~10% vs cl100k_base but the signal — 'is the LLM
output 90% smaller than the raw HTML' — is order-of-magnitude, not
precise accounting. Output is labeled '~ tokens' so nobody mistakes
it for a real BPE count. Swapping in tiktoken-rs later is a one
function change; left out of v1 to avoid the 2 MB BPE-data binary
bloat for a feature most users will run a handful of times.

Implemented as a real clap subcommand (clap::Subcommand) rather than
yet another flag, with the existing flag-based flow falling through
when no subcommand is given. Existing 'webclaw <url> --format ...'
invocations work exactly as before. Lays the groundwork for future
subcommands without disrupting the legacy flat-flag UX.

12 new unit tests cover the tokenizer, formatters, host extraction,
and fact-matching. Verified end-to-end on example.com and tavily.com
(5/5 facts preserved at 93% token reduction).
2026-04-22 12:25:29 +02:00
Valerio
d69c50a31d
feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
* feat(fetch,llm): DoS hardening via response caps + glob validation (P2)

Response body caps:
- webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks
  Content-Length up front (before the allocation) and the actual
  .bytes() length after (belt-and-braces against lying upstreams).
  Previously the HTML -> markdown conversion downstream could allocate
  multiple String copies per page; a 100 MB page would OOM the process.
- webclaw-llm providers (anthropic/openai/ollama) share a new
  response_json_capped helper with a 5 MB cap. Protects against a
  malicious or runaway provider response exhausting memory.

Crawler frontier cap: after each BFS depth level the frontier is
truncated to max(max_pages * 10, 100) entries, keeping the most
recently discovered links. Dense pages (tag clouds, search results)
used to push the frontier into the tens of thousands even after
max_pages halted new fetches.

Glob pattern validation: user-supplied include_patterns /
exclude_patterns are rejected at Crawler::new if they contain more
than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher
degrades exponentially on deeply-nested `**` against long paths.

Cleanup:
- Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs;
  no warnings surfaced, the suppression was obsolete.
- core/.gitignore: replaced overbroad *.json with specific local-
  artifact patterns (previous rule would have swallowed package.json,
  components.json, .smithery/*.json).

Tests: +4 validate_glob tests. Full workspace test: 283 passed
(webclaw-core + webclaw-fetch + webclaw-llm).

Version: 0.3.15 -> 0.3.16
CHANGELOG updated.

Refs: docs/AUDIT-2026-04-16.md (P2 section)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: gitignore CLI research dumps, drop accidentally-tracked file

research-*.json output from `webclaw ... --research ...` got silently
swept into git by the relaxed *.json gitignore in the preceding commit.
The old blanket *.json rule was hiding both this legitimate scratch
file AND packages/create-webclaw/server.json (MCP registry config that
we DO want tracked).

Removes the research dump from git and adds a narrower research-*.json
ignore pattern so future CLI output doesn't get re-tracked by accident.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 19:44:08 +02:00
Valerio
1352f48e05
fix(cli): close --on-change command injection via sh -c (P0) (#20)
* fix(cli): close --on-change command injection via sh -c (P0)

The --on-change flag on `webclaw watch` (single-URL, line 1588) and
`webclaw watch` multi-URL mode (line 1738) previously handed the entire
user-supplied string to `tokio::process::Command::new("sh").arg("-c").arg(cmd)`.
Any path that can influence that string — a malicious config file, an MCP
client driven by an LLM with prompt-injection exposure, an untrusted
environment variable substitution — gets arbitrary shell execution.

The command is now tokenized with `shlex::split` (POSIX-ish quoting rules)
and executed directly via `Command::new(prog).args(args)`. Metacharacters
like `;`, `&&`, `|`, `$()`, `<(...)`, env expansion, and globbing no longer
fire.

An explicit opt-in escape hatch is available for users who genuinely need
a shell pipeline: `WEBCLAW_ALLOW_SHELL=1` preserves the old `sh -c` path
and logs a warning on every invocation so it can't slip in silently.

Both call sites now route through a shared `spawn_on_change()` helper.

Adds `shlex = "1"` to webclaw-cli dependencies.

Version: 0.3.13 -> 0.3.14
CHANGELOG updated.

Surfaced by the 2026-04-16 workspace audit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore(brand): fix clippy 1.95 unnecessary_sort_by errors

Pre-existing sort_by calls in brand.rs became hard errors under clippy
1.95. Switch to sort_by_key with std::cmp::Reverse. Pure refactor — same
ordering, no behavior change. Bundled here so CI goes green on the P0
command-injection fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 18:37:02 +02:00
Valerio
050b2ef463 feat: add allow_subdomains and allow_external_links to CrawlConfig
Crawls are same-origin by default. Enable allow_subdomains to follow
sibling/child subdomains (blog.example.com from example.com), or
allow_external_links for full cross-origin crawling.

Root domain extraction uses a heuristic that handles two-part TLDs
(co.uk, com.au). Includes 5 unit tests for root_domain().

Bump to 0.3.12.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:33:06 +02:00
Valerio
f7cc0cc5cf feat: CLI --research flag + MCP cloud fallback + structured research output
- --research "query": deep research via cloud API, saves JSON file with
  report + sources + findings, prints report to stdout
- --deep: longer, more thorough research mode
- MCP extract/summarize: cloud fallback when no local LLM available
- MCP research: returns structured JSON instead of raw text
- Bump to v0.3.7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 14:04:04 +02:00
Valerio
344eea74d9 feat: structured data in markdown/LLM output + v0.3.6
__NEXT_DATA__, SvelteKit, and JSON-LD now appear as a
## Structured Data section in -f markdown and -f llm output.
Works with --only-main-content and all extraction flags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:16:56 +02:00
Valerio
da1d76c97a feat: add --cookie-file support for JSON cookie files
- --cookie-file reads Chrome extension format ([{name, value, domain, ...}])
- Works with EditThisCookie, Cookie-Editor, and similar browser extensions
- Merges with --cookie when both provided
- MCP scrape tool now accepts cookies parameter
- Closes #7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 10:54:53 +02:00
Valerio
f13cb83c73 feat: replace primp with webclaw-tls, bump to v0.3.0
Replace primp dependency with our own TLS fingerprinting stack
(webclaw-tls). Perfect Chrome 146 JA4 + Akamai hash match.

- Remove primp entirely (zero references remaining)
- webclaw-fetch now uses webclaw-http from github.com/0xMassi/webclaw-tls
- Native + Mozilla root CAs (fixes HTTPS on cross-signed cert chains)
- Skip unknown certificate extensions (SCT tolerance)
- 99% bypass rate on 102 sites (was ~85% with primp)
- Fixes #5 (HTTPS broken — example.com and similar sites now work)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 16:40:10 +02:00
Valerio
ea14848772 feat: v0.2.0 — DOCX/XLSX/CSV extraction, HTML format, multi-URL watch, batch LLM
Document extraction:
- DOCX: auto-detected, outputs markdown with headings (via zip + quick-xml)
- XLSX/XLS: markdown tables with multi-sheet support (via calamine)
- CSV: quoted field handling, markdown table output
- All auto-detected by Content-Type header or URL extension

New features:
- -f html output format (sanitized HTML)
- Multi-URL watch: --urls-file + --watch monitors all URLs in parallel
- Batch + LLM: --extract-prompt/--extract-json works with multiple URLs
- Mixed batch: HTML pages + DOCX + XLSX + CSV in one command

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:28:23 +01:00
Valerio
0e4128782a fix: v0.1.7 — extraction options now work in batch mode (#3)
--only-main-content, --include, and --exclude were ignored in batch
mode because run_batch used default ExtractionOptions. Added
fetch_and_extract_batch_with_options to pass CLI options through.

Closes #3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 13:30:20 +01:00
Valerio
1b8dfb77a6 feat: v0.1.6 — watch mode, webhooks (Discord/Slack auto-format)
Watch mode:
- --watch polls a URL at --watch-interval (default 5min)
- Reports diffs to stdout when content changes
- --on-change runs a command with diff JSON on stdin
- Ctrl+C stops cleanly

Webhooks:
- --webhook POSTs JSON on crawl/batch complete and watch changes
- Auto-detects Discord and Slack URLs, formats as embeds/blocks
- Also available via WEBCLAW_WEBHOOK_URL env var
- Non-blocking, errors logged to stderr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 12:30:08 +01:00
Valerio
e5649e1824 feat: v0.1.5 — --output-dir saves each page to a separate file
Adds --output-dir flag for CLI. Each extracted page gets its own file
with filename derived from the URL path. Works with single URL, crawl,
and batch modes. CSV input supports custom filenames (url,filename).

Root URLs use hostname/index.ext to avoid collisions in batch mode.
Subdirectories created automatically from URL path structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 11:02:25 +01:00
Valerio
0c91c6d5a9 feat: v0.1.3 — crawl streaming, resume/cancel, MCP proxy support
Crawl:
- Real-time progress on stderr as pages complete
- --crawl-state saves progress on Ctrl+C, resumes from saved state
- Visited set + remaining frontier persisted for accurate resume

MCP server:
- Reads WEBCLAW_PROXY and WEBCLAW_PROXY_FILE env vars
- Falls back to proxies.txt in CWD (existing behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 21:38:28 +01:00
Valerio
b92c0ed186 style: fix cargo fmt formatting
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:27:15 +01:00
Valerio
ea9c783bc5 fix: v0.1.1 — MCP identity, timeouts, exit codes, URL validation
Critical:
- MCP server identifies as "webclaw-mcp" instead of "rmcp"
- Research tool poll loop capped at 200 iterations (~10 min)

CLI:
- Non-zero exit codes on errors
- Text format strips markdown table syntax

MCP server:
- URL validation on all tools
- 60s cloud API timeout, 30s local fetch timeout
- Diff cloud fallback computes actual diff
- Batch capped at 100 URLs, crawl at 500 pages
- Graceful startup failure instead of panic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:25:05 +01:00
Valerio
c99ec684fa Initial release: webclaw v0.1.0 — web content extraction for LLMs
CLI + MCP server for extracting clean, structured content from any URL.
6 Rust crates, 10 MCP tools, TLS fingerprinting, 5 output formats.

MIT Licensed | https://webclaw.io
2026-03-23 18:31:11 +01:00