Commit graph

91 commits

Author SHA1 Message Date
Valerio
eff914e84f
Merge pull request #31 from 0xMassi/feat/oss-webclaw-server
v0.4.0: self-hosted REST server, bench subcommand, mcp warning fix (#26, #29, #30)
2026-04-22 12:30:23 +02:00
Valerio
c7e5abea8f docs(changelog): v0.4.0 release notes (#26, #29, #30) 2026-04-22 12:25:44 +02:00
Valerio
d71eebdacc fix(mcp): silence dead-code warning on tool_router field (closes #30)
cargo install webclaw-mcp on a fresh machine prints

  warning: field `tool_router` is never read
   --> crates/webclaw-mcp/src/server.rs:22:5

The field is essential — dropping it unregisters every MCP tool. The
warning shows up because rmcp 1.3.x changed how the #[tool_handler]
macro reads the field: instead of referencing it by name in the
generated impl, it goes through a derived trait method. rustc's
dead-code lint sees only the named usage and fires.

The field stays. Annotated with #[allow(dead_code)] and a comment
explaining the situation so the next person looking at this doesn't
remove the field thinking it's actually unused.

No behaviour change. Verified clean compile under rmcp 1.3.0 in our
lock; the warning will disappear for anyone running cargo install
against this commit.
2026-04-22 12:25:39 +02:00
Valerio
d91ad9c1f4 feat(cli): add webclaw bench <url> subcommand (closes #26)
Per-URL extraction micro-benchmark. Fetches a URL once, runs the same
pipeline as --format llm, prints a small ASCII table comparing raw
HTML vs. llm output on tokens, bytes, and extraction time.

  webclaw bench https://stripe.com               # ASCII table
  webclaw bench https://stripe.com --json        # one-line JSON
  webclaw bench https://stripe.com --facts FILE  # adds fidelity row

The --facts file uses the same schema as benchmarks/facts.json (curated
visible-fact list per URL). URLs not in the file produce no fidelity
row, so an uncurated site doesn't show 0/0.

v1 uses an approximate tokenizer (chars/4 Latin, chars/2 when CJK
dominates). Off by ~10% vs cl100k_base but the signal — 'is the LLM
output 90% smaller than the raw HTML' — is order-of-magnitude, not
precise accounting. Output is labeled '~ tokens' so nobody mistakes
it for a real BPE count. Swapping in tiktoken-rs later is a one
function change; left out of v1 to avoid the 2 MB BPE-data binary
bloat for a feature most users will run a handful of times.

Implemented as a real clap subcommand (clap::Subcommand) rather than
yet another flag, with the existing flag-based flow falling through
when no subcommand is given. Existing 'webclaw <url> --format ...'
invocations work exactly as before. Lays the groundwork for future
subcommands without disrupting the legacy flat-flag UX.

12 new unit tests cover the tokenizer, formatters, host extraction,
and fact-matching. Verified end-to-end on example.com and tavily.com
(5/5 facts preserved at 93% token reduction).
2026-04-22 12:25:29 +02:00
Valerio
2ba682adf3 feat(server): add OSS webclaw-server REST API binary (closes #29)
Self-hosters hitting docs/self-hosting were promised three binaries
but the OSS Docker image only shipped two. webclaw-server lived in
the closed-source hosted-platform repo, which couldn't be opened. This
adds a minimal axum REST API in the OSS repo so self-hosting actually
works without pretending to ship the cloud platform.

Crate at crates/webclaw-server/. Stateless, no database, no job queue,
single binary. Endpoints: GET /health, POST /v1/{scrape, crawl, map,
batch, extract, summarize, diff, brand}. JSON shapes mirror
api.webclaw.io for the endpoints OSS can support, so swapping between
self-hosted and hosted is a base-URL change.

Auth: optional bearer token via WEBCLAW_API_KEY / --api-key. Comparison
is constant-time (subtle::ConstantTimeEq). Open mode (no key) is
allowed and binds 127.0.0.1 by default; the Docker image flips
WEBCLAW_HOST=0.0.0.0 so the container is reachable out of the box.

Hard caps to keep naive callers from OOMing the process: crawl capped
at 500 pages synchronously, batch capped at 100 URLs / 20 concurrent.
For unbounded crawls or anti-bot bypass the docs point users at the
hosted API.

Dockerfile + Dockerfile.ci updated to copy webclaw-server into
/usr/local/bin and EXPOSE 3000. Workspace version bumped to 0.4.0
(new public binary).
2026-04-22 12:25:11 +02:00
Valerio
b4bfff120e
fix(docker): entrypoint shim so child images with custom CMD work (#28)
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
v0.3.13 switched ENTRYPOINT to ["webclaw"] to make `docker run IMAGE
https://example.com` work. That broke a different use case: downstream
Dockerfiles that `FROM ghcr.io/0xmassi/webclaw` and set their own
CMD ["./setup.sh"] — the child's ./setup.sh becomes arg to webclaw,
which tries to fetch it as a URL and fails:

  fetch error: request failed: error sending request for uri
  (https://./setup.sh): client error (Connect)

Both Dockerfile and Dockerfile.ci now use docker-entrypoint.sh which:
- forwards flags (-*) and URLs (http://, https://) to `webclaw`
- exec's anything else directly

Test matrix (all pass locally):
  docker run IMAGE https://example.com     → webclaw scrape ok
  docker run IMAGE --help                   → webclaw --help ok
  docker run IMAGE                          → default CMD, --help
  docker run IMAGE bash                     → bash runs
  FROM IMAGE + CMD ["./setup.sh"]           → setup.sh runs, webclaw available

Default CMD is ["webclaw", "--help"] so bare `docker run IMAGE` still
prints help.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:57:47 +02:00
Valerio
e27ee1f86f
docs(benchmarks): reproducible 3-way comparison vs trafilatura + firecrawl (#25)
Replaces the previous benchmarks/README.md, which claimed specific numbers
(94.2% accuracy, 0.8ms extraction, 97% Cloudflare bypass, etc.) with no
reproducing code committed to the repo. The `webclaw-bench` crate and
`benchmarks/fixtures`, `benchmarks/ground-truth` directories it referenced
never existed. This is what #18 was calling out.

New benchmarks/ is fully reproducible. Every number ships with the script
that produced it. `./benchmarks/run.sh` regenerates everything.

Results (18 sites, 90 hand-curated facts, median of 3 runs, webclaw 0.3.18,
cl100k_base tokenizer):

  tool          reduction_mean   fidelity        latency_mean
  webclaw              92.5%    76/90 (84.4%)        0.41s
  firecrawl            92.4%    70/90 (77.8%)        0.99s
  trafilatura          97.8%    45/90 (50.0%)        0.21s

webclaw matches or beats both competitors on fidelity on all 18 sites
while running 2.4x faster than Firecrawl's hosted API.

Includes:
- README.md              — headline table + per-site breakdown
- methodology.md         — tokenizer, fact selection, run rationale
- sites.txt              — 18 canonical URLs
- facts.json             — 90 curated facts (PRs welcome to add sites)
- scripts/bench.py       — the runner
- results/2026-04-17.json — today's raw data, median of 3 runs
- run.sh                 — one-command reproduction

Closes #18

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:46:19 +02:00
Valerio
0463b5e263 style: cargo fmt
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 12:03:22 +02:00
Valerio
7f0420bbf0
fix(core): UTF-8 char boundary panic in find_content_position (#16) (#24)
`search_from = abs_pos + 1` landed mid-char when a rejected match
started on a multi-byte UTF-8 character, panicking on the next
`markdown[search_from..]` slice. Advance by `needle.len()` instead —
always a valid char boundary, and skips the whole rejected match
instead of re-scanning inside it.

Repro: webclaw https://bruler.ru/about_brand -f json
Before: panic "byte index 782 is not a char boundary; it is inside 'Ч'"
After:  extracts 2.3KB of clean Cyrillic markdown with 7 sections

Two regression tests cover multi-byte rejected matches and
all-rejected cycles in Cyrillic text.

Closes #16

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 12:02:52 +02:00
Valerio
095ae5d4b1
polish(fetch,mcp): robots parser + firefox client cache + Acquire ordering (P3) (#23)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Three P3 items from the 2026-04-16 audit. Bump to 0.3.17.

webclaw-fetch/sitemap.rs: parse_robots_txt used trimmed[..8] slice
plus eq_ignore_ascii_case for the directive test. That was fragile:
"Sitemap :" (space before colon) fell through silently, inline
"# ..." comments leaked into the URL, and a line with no URL at all
returned an empty string. Rewritten to split on the first colon,
match any-case "sitemap" as the directive name, strip comments, and
require `://` in the value. +7 unit tests cover case variants,
space-before-colon, comments, empty values, non-URL values, and
non-sitemap directives.

webclaw-fetch/crawler.rs: is_cancelled uses Ordering::Acquire
instead of Relaxed. Behaviourally equivalent on current hardware for
single-word atomic loads, but the explicit ordering documents intent
for readers + compilers.

webclaw-mcp/server.rs: add lazy OnceLock cache for the Firefox
FetchClient. Tool calls that repeatedly request the firefox profile
without cookies used to build a fresh reqwest pool + TLS stack per
call. Chrome (default) already used the long-lived field; Random is
per-call by design; cookie-bearing requests still build ad-hoc since
the cookie header is part of the client shape.

Tests: 85 webclaw-fetch (was 78, +7 new sitemap), 272 webclaw-core,
43 webclaw-llm, 11 CLI — all green. Clippy clean across workspace.

Refs: docs/AUDIT-2026-04-16.md P3 section

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 20:21:32 +02:00
Valerio
d69c50a31d
feat(fetch,llm): DoS hardening + glob validation + cleanup (P2) (#22)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
* feat(fetch,llm): DoS hardening via response caps + glob validation (P2)

Response body caps:
- webclaw-fetch::Response::from_wreq now rejects bodies over 50 MB. Checks
  Content-Length up front (before the allocation) and the actual
  .bytes() length after (belt-and-braces against lying upstreams).
  Previously the HTML -> markdown conversion downstream could allocate
  multiple String copies per page; a 100 MB page would OOM the process.
- webclaw-llm providers (anthropic/openai/ollama) share a new
  response_json_capped helper with a 5 MB cap. Protects against a
  malicious or runaway provider response exhausting memory.

Crawler frontier cap: after each BFS depth level the frontier is
truncated to max(max_pages * 10, 100) entries, keeping the most
recently discovered links. Dense pages (tag clouds, search results)
used to push the frontier into the tens of thousands even after
max_pages halted new fetches.

Glob pattern validation: user-supplied include_patterns /
exclude_patterns are rejected at Crawler::new if they contain more
than 4 `**` wildcards or exceed 1024 chars. The backtracking matcher
degrades exponentially on deeply-nested `**` against long paths.

Cleanup:
- Removed blanket #![allow(dead_code)] from webclaw-cli/src/main.rs;
  no warnings surfaced, the suppression was obsolete.
- core/.gitignore: replaced overbroad *.json with specific local-
  artifact patterns (previous rule would have swallowed package.json,
  components.json, .smithery/*.json).

Tests: +4 validate_glob tests. Full workspace test: 283 passed
(webclaw-core + webclaw-fetch + webclaw-llm).

Version: 0.3.15 -> 0.3.16
CHANGELOG updated.

Refs: docs/AUDIT-2026-04-16.md (P2 section)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: gitignore CLI research dumps, drop accidentally-tracked file

research-*.json output from `webclaw ... --research ...` got silently
swept into git by the relaxed *.json gitignore in the preceding commit.
The old blanket *.json rule was hiding both this legitimate scratch
file AND packages/create-webclaw/server.json (MCP registry config that
we DO want tracked).

Removes the research dump from git and adds a narrower research-*.json
ignore pattern so future CLI output doesn't get re-tracked by accident.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 19:44:08 +02:00
Valerio
7773c8af2a
fix(fetch): surface semaphore-closed as typed error instead of panic (P1) (#21)
Three call sites in webclaw-fetch used .expect("semaphore closed") on
`Semaphore::acquire()`. Under normal operation they never fire, but
under a shutdown race or adversarial runtime state the spawned task
would panic and be silently dropped from the batch / crawl run — the
caller would see fewer results than URLs with no indication why.

Rewritten to match on the acquire result:
- client::fetch_batch and client::fetch_and_extract_batch_with_options
  now emit BatchResult/BatchExtractResult carrying
  FetchError::Build("semaphore closed before acquire").
- crawler's inner loop emits a failed PageResult with the same error
  string instead of panicking.

Behaviorally a no-op for the happy path. Fixes the silent-dropped-task
class of bug noted in the 2026-04-16 audit.

Version: 0.3.14 -> 0.3.15
CHANGELOG updated.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 19:20:26 +02:00
Valerio
1352f48e05
fix(cli): close --on-change command injection via sh -c (P0) (#20)
* fix(cli): close --on-change command injection via sh -c (P0)

The --on-change flag on `webclaw watch` (single-URL, line 1588) and
`webclaw watch` multi-URL mode (line 1738) previously handed the entire
user-supplied string to `tokio::process::Command::new("sh").arg("-c").arg(cmd)`.
Any path that can influence that string — a malicious config file, an MCP
client driven by an LLM with prompt-injection exposure, an untrusted
environment variable substitution — gets arbitrary shell execution.

The command is now tokenized with `shlex::split` (POSIX-ish quoting rules)
and executed directly via `Command::new(prog).args(args)`. Metacharacters
like `;`, `&&`, `|`, `$()`, `<(...)`, env expansion, and globbing no longer
fire.

An explicit opt-in escape hatch is available for users who genuinely need
a shell pipeline: `WEBCLAW_ALLOW_SHELL=1` preserves the old `sh -c` path
and logs a warning on every invocation so it can't slip in silently.

Both call sites now route through a shared `spawn_on_change()` helper.

Adds `shlex = "1"` to webclaw-cli dependencies.

Version: 0.3.13 -> 0.3.14
CHANGELOG updated.

Surfaced by the 2026-04-16 workspace audit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore(brand): fix clippy 1.95 unnecessary_sort_by errors

Pre-existing sort_by calls in brand.rs became hard errors under clippy
1.95. Switch to sort_by_key with std::cmp::Reverse. Pure refactor — same
ordering, no behavior change. Bundled here so CI goes green on the P0
command-injection fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 18:37:02 +02:00
Valerio
6316b1a6e7 fix: handle raw newlines in JSON-LD strings
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Sites like Bluesky emit JSON-LD with literal newline characters inside
string values (technically invalid JSON). Add sanitize_json_newlines()
fallback that escapes control characters inside quoted strings before
retrying the parse. This recovers ProfilePage, Product, and other
structured data that was previously silently dropped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 11:40:25 +02:00
Valerio
78e198a347 fix: use ENTRYPOINT instead of CMD in Dockerfiles for proper arg passthrough
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
Docker CMD gets overridden by any args, while ENTRYPOINT receives them.
This fixes `docker run webclaw <url>` silently ignoring the URL argument.

Bump to 0.3.13.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:24:26 +02:00
Valerio
050b2ef463 feat: add allow_subdomains and allow_external_links to CrawlConfig
Crawls are same-origin by default. Enable allow_subdomains to follow
sibling/child subdomains (blog.example.com from example.com), or
allow_external_links for full cross-origin crawling.

Root domain extraction uses a heuristic that handles two-part TLDs
(co.uk, com.au). Includes 5 unit tests for root_domain().

Bump to 0.3.12.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:33:06 +02:00
Valerio
a4c351d5ae feat: add fallback sitemap paths for broader discovery
Try /sitemap_index.xml, /wp-sitemap.xml, and /sitemap/sitemap-index.xml
after the standard /sitemap.xml. WordPress 5.5+ and many CMS platforms
use non-standard paths that were previously missed. Paths found via
robots.txt are deduplicated to avoid double-fetching.

Bump to 0.3.11.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 18:22:57 +02:00
Valerio
25b6282d5f style: fix rustfmt for 2-element delay array
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 17:21:53 +02:00
Valerio
954aabe3e8 perf: reduce fetch timeout to 12s and retries to 2
Stress testing showed 33% of proxies are dead, causing 30s+ timeouts
per request with 3 retries (worst case 94s). Reducing timeout from 30s
to 12s and retries from 3 to 2 brings worst case to 25s. Combined with
disabling 509 dead proxies from the pool, this should significantly
improve response times under load.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 17:18:57 +02:00
Valerio
5ea646a332 fix: resolve clippy warnings from #14 (collapsible_if, manual_inspect)
CI runs Rust 1.94 which flags these. Collapsed nested if-let in
cell_has_block_content() and replaced .map()+return with .inspect()
in table_to_md().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 15:28:59 +02:00
Valerio
3cf9dbaf2a chore: bump to 0.3.9, fix formatting from #14
Version bump for layout table, stack overflow, and noise filter fixes
contributed by @devnen. Also fixes cargo fmt issues that caused CI lint
failure on the merge commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 15:24:17 +02:00
Valerio
87ecf4241f
fix: layout tables, stack overflow, and noise filter (#14)
fix: layout tables rendered as sections instead of markdown tables
2026-04-04 15:20:08 +02:00
devnen
70c67f2ed6 fix: prevent noise filter from swallowing content in malformed HTML
Two related fixes for content being stripped by the noise filter:

1. Remove <form> from unconditional noise tags. ASP.NET and similar
   frameworks wrap entire pages in a <form> tag — these are not input
   forms. Forms with >500 chars of text are now treated as content
   wrappers, not noise.

2. Add safety valve for class/ID noise matching. When malformed HTML
   leaves a noise container unclosed (e.g., <div class="header"> missing
   its </div>), the HTML5 parser makes all subsequent siblings into
   children of that container. A header/nav/footer with >5000 chars of
   text is almost certainly a broken wrapper absorbing real content —
   exempt it from noise filtering.
2026-04-04 01:38:42 +02:00
devnen
74bac87435 fix: prevent stack overflow on deeply nested HTML pages
Pages like Express.co.uk live blogs nest 200+ DOM levels deep, overflowing
the default 1 MB main-thread stack on Windows during recursive markdown
conversion.

Two-layer fix:

1. markdown.rs: add depth parameter to node_to_md/children_to_md/inline_text
   with MAX_DOM_DEPTH=200 guard — falls back to plain text collection at limit

2. lib.rs: wrap extract_with_options in a worker thread with 8 MB stack so
   html5ever parsing and extraction both have room on deeply nested pages

Tested with Express.co.uk live blog (previously crashed, now extracts 2000+
lines of clean markdown) and drudgereport.com (still works correctly).
2026-04-03 23:45:19 +02:00
devnen
95a6681b02 fix: detect layout tables and render as sections instead of markdown tables
Sites like Drudge Report use <table> for page layout, not data. Each cell
contains extensive block-level content (divs, hrs, paragraphs, links).

Previously, table_to_md() called inline_text() on every cell, collapsing
all whitespace and flattening block elements into a single unreadable line.

Changes:
- Add cell_has_block_content() heuristic: scans for block-level descendants
  (p, div, hr, ul, ol, h1-h6, etc.) to distinguish layout vs data tables
- Layout tables render each cell as a standalone section separated by blank
  lines, using children_to_md() to preserve block structure
- Data tables (no block elements in cells) keep existing markdown table format
- Bold/italic tags containing block elements are treated as containers
  instead of wrapping in **/**/* (fixes Drudge's <b><font>...</font></b>
  column wrappers that contain the entire column content)
- Add tests for layout tables with paragraphs and with links
2026-04-03 22:24:35 +02:00
Valerio
1d2018c98e fix: MCP research saves to file, returns compact response
Research results saved to ~/.webclaw/research/ (report.md + full.json).
MCP returns file paths + findings instead of the full report, preventing
"exceeds maximum allowed tokens" errors in Claude/Cursor.

Same query returns cached result instantly without spending credits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 16:05:45 +02:00
Valerio
f7cc0cc5cf feat: CLI --research flag + MCP cloud fallback + structured research output
- --research "query": deep research via cloud API, saves JSON file with
  report + sources + findings, prints report to stdout
- --deep: longer, more thorough research mode
- MCP extract/summarize: cloud fallback when no local LLM available
- MCP research: returns structured JSON instead of raw text
- Bump to v0.3.7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 14:04:04 +02:00
Valerio
344eea74d9 feat: structured data in markdown/LLM output + v0.3.6
__NEXT_DATA__, SvelteKit, and JSON-LD now appear as a
## Structured Data section in -f markdown and -f llm output.
Works with --only-main-content and all extraction flags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:16:56 +02:00
Valerio
b219fc3648 fix(ci): update all 4 Homebrew checksums after Docker build completes
Previous approach used mislav/bump-homebrew-formula-action which only
updated macOS arm64 SHA. Now downloads all 4 tarballs after Docker
finishes, computes SHAs, and writes the complete formula.

Fixes #12 (brew install checksum mismatch on Linux)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 19:02:27 +02:00
Valerio
8d29382b25 feat: extract __NEXT_DATA__ into structured_data
Next.js pages embed server-rendered data in <script id="__NEXT_DATA__">.
Now extracted as structured JSON (pageProps) in the structured_data field.

Tested on 45 sites — 13 return rich structured data including prices,
product info, and page state not visible in the DOM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:04:51 +02:00
Valerio
4e81c3430d docs: update npm package license to AGPL-3.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:33:43 +02:00
Valerio
c43da982c3 docs: update README license references from MIT to AGPL-3.0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 11:28:40 +02:00
Valerio
84b2e6092e feat: SvelteKit data extraction + license change to AGPL-3.0
- Extract structured JSON from SvelteKit kit.start() data arrays
- Convert JS object literals (unquoted keys) to valid JSON
- Data appears in structured_data field (machine-readable)
- License changed from MIT to AGPL-3.0
- Bump to v0.3.4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 20:37:56 +02:00
Valerio
b4800e681c ci: fix aarch64 cross-compilation for BoringSSL (boring-sys2)
boring-sys2 builds BoringSSL from C source via cmake. For aarch64 cross-
compilation, we need g++, cmake, and CC/CXX env vars pointing to the
cross-compiler. Also removed stale reqwest_unstable RUSTFLAG.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:39:43 +02:00
Valerio
a1b9a55048 chore: add SKILL.md to repo root for skills.sh discoverability
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:27:17 +02:00
Valerio
124352e0b4 style: cargo fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:25:40 +02:00
Valerio
1a5d3d8aaf chore: remove reqwest_unstable rustflag (no longer needed)
The --cfg reqwest_unstable flag was required by the old patched reqwest.
wreq handles everything internally — no special build flags needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:15:05 +02:00
Valerio
11b8f68f51 fix: update Dockerfile for BoringSSL build deps (cmake, clang)
wreq uses BoringSSL (via boring-sys2) which needs cmake and clang
at build time. Removed stale reference to Impit's patched rustls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:13:18 +02:00
Valerio
aaf51eddef feat: replace custom TLS stack with wreq (BoringSSL), bump v0.3.3
Migrated webclaw-fetch from webclaw-tls (patched rustls/h2/hyper/reqwest)
to wreq by @0x676e67. wreq uses BoringSSL for TLS and the http2 crate
for HTTP/2 fingerprinting — battle-tested with 60+ browser profiles.

This removes all 5 [patch.crates-io] entries that consumers previously
needed. Browser profiles (Chrome 145, Firefox 135, Safari 18, Edge 145)
are now built directly on wreq's Emulation API with correct TLS options,
HTTP/2 SETTINGS ordering, pseudo-header order, and header wire order.

84% pass rate across 1000 real sites. 384 unit tests green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:04:55 +02:00
Valerio
0d0da265ab chore: bump to v0.3.2, update changelog
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 10:56:51 +02:00
Valerio
da1d76c97a feat: add --cookie-file support for JSON cookie files
- --cookie-file reads Chrome extension format ([{name, value, domain, ...}])
- Works with EditThisCookie, Cookie-Editor, and similar browser extensions
- Merges with --cookie when both provided
- MCP scrape tool now accepts cookies parameter
- Closes #7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 10:54:53 +02:00
Valerio
44f23332cc style: collapse nested if per clippy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:13:55 +02:00
Valerio
20c810b8d2 chore: bump v0.3.1, update CHANGELOG, fix fmt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:11:54 +02:00
Valerio
7041a1d992 feat: cookie warmup fallback for Akamai-protected pages
When a fetch returns a challenge page (small HTML with Akamai markers),
automatically visit the homepage first to collect _abck/bm_sz cookies,
then retry the original URL. This bypasses Akamai's cookie-based gate
on subpages without needing JS execution.

Detected via: <title>Challenge Page</title> or bazadebezolkohpepadr
sensor marker on responses under 15KB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:09:31 +02:00
github-actions[bot]
75e0a9cdef chore: update webclaw-tls dependencies 2026-03-30 12:03:06 +00:00
github-actions[bot]
b784a3fa1b chore: update webclaw-tls dependencies 2026-03-30 11:48:44 +00:00
Valerio
4cba36337b style: fix fmt in client.rs test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:18:57 +02:00
Valerio
199dab6dfa fix: adapt to webclaw-tls v0.1.1 HeaderMap API change
Response.headers() now returns &http::HeaderMap instead of
&HashMap<String, String>. Updated FetchResult, is_pdf_content_type,
is_document_content_type, is_bot_protected, and all related tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:09:50 +02:00
github-actions[bot]
68b9406ff5 chore: update webclaw-tls dependencies 2026-03-30 09:53:03 +00:00
Valerio
31f35fd895 ci: fix ambiguous reqwest version in dependency sync
Core has reqwest 0.12 (direct) and 0.13 (via webclaw-tls patch).
Disambiguate with version specs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:52:35 +02:00