Commit graph

169 commits

Author SHA1 Message Date
devnen
d5a3aa4bf9 feat(core): word-count breakdown in header — article vs chrome split
Current Word count: N is a single number conflating article body
and surrounding chrome (nav, ads, footer). Callers couldn't tell
from the header alone whether to drill or move on.

New: Word count: N (article: M, chrome: K) in -f llm/text output.
For -f json: adds word_count_article and word_count_chrome
fields alongside the existing word_count.

M (article body) is sourced from JSON-LD articleBody when M4's
parser found one (NewsArticle or Review.reviewBody); otherwise
computed by llm::body_word_count (the M2-style heuristic — words
outside markdown link patterns, the same body::process_body output
hub_detect uses).

--mode summary / toc / sections fall back to the simple Word count: N
form (the modes don't extract body content; the breakdown would be
meaningless). Suppression piggybacks on the existing include_status
toggle in build_metadata_header_with_opts.

9 new tests in webclaw-core (4 in lib.rs::tests for the population
logic; 5 in llm/metadata.rs::m12_tests for the header formatter).
Workspace 701 -> 710.
2026-05-23 23:56:14 +02:00
devnen
ade2a5143c feat(core): --mode sections for nav-URL discovery
Section-URL ambiguity is recurring friction — callers have to guess
whether to hit infobae.com root (LATAM frontpage) or /economia/ (AR-
specific live FX dashboard), or decrypt.co root (ticker ribbon) vs
/news/ (article list), or bbc.com/news/world vs /news/world/europe/.
Each guess costs a round-trip.

New `--mode sections` returns the discoverable section URLs parsed
from the page's nav, in one round-trip. Subsumes issue #16 (non-
English nav harder to LLM-parse — sections come back as data, not
prose).

Multi-signal heuristic on the existing link extraction:
URL-pattern match (/<category>/ style short paths), repetition
(section links appear in header + footer), DOM-position when
available. Fallback when zero sections detected: emit top-N links
with a "(none detected; first N shown)" note.

Format: -f llm/text emits `Sections:` followed by `- [Label](url)`
list. -f json emits `{"sections": [{"label": "...", "url": "..."}]}`.

13 new tests in webclaw-core (688 -> 701).
2026-05-23 23:14:40 +02:00
devnen
76cd515a3e feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites
On sites like Hollywood Reporter where the extracted body is < 500 words
because the page is JS-walled (chrome rendering is needed), webclaw now
emits a one-line stderr hint:

  # hint: extracted body is N words (thin); the page may be JS-walled.
    Try --browser chrome for JS-rendered content.

Thin-body classification (crates/webclaw-core/src/llm/thin_body.rs)
mirrors the M2 hub-detector structure. Threshold: 500 words. Exemption
list for utility domains (example.com, httpbin.org, etc) where thinness
is by design.

The originally proposed --retry-thin flag was dropped after phase A
determined webclaw has no headless-JS backend to retry to (--browser
only affects User-Agent impersonation, not actual rendering). The
hint-only design lets the caller decide: re-run with --browser chrome
manually, or switch to a different fetcher entirely.

Hint suppressed in --mode summary / --mode toc (link/outline focused);
M3 fast-fails skip the formatter entirely so no hint.

Stdout invariance: tested byte-identical on all p01-p15 default probes.
M10 only modifies stderr.

10 new tests (workspace 678 -> 688).
2026-05-23 22:18:12 +02:00
devnen
dfcd51d9e0 feat(core): HTTP status header line in -f llm/text/json output
Webclaw previously emitted URL, Title, Description, and Word count in
the -f llm header but no HTTP status. On a 404 response, the caller
had no signal apart from inspecting the body (e.g. dailysabah.com/
business/economy returns a 404 page; webclaw was extracting '13 words'
of the error page without flagging the 404 status).

New behavior: every -f llm/text/json output includes a 'Status: <code>'
header line (after URL: per phase A's placement). Emitted on all
responses including 200 for consistency — callers can't otherwise
distinguish 'webclaw saw 200' from 'webclaw missed status info'.

For -f json: top-level "status": <code> field added.

Modes --mode summary and --mode toc are exempt: the status line would
clutter the link-list and outline outputs.
M3 fast-fails (known-bad-sites) also skip the status line because
they exit before the formatter is reached.

7 new tests in webclaw-core (workspace total 671 -> 678).
2026-05-23 21:29:26 +02:00
devnen
66974366d7 feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld
JSON-LD is consistently the cleanest source on major outlets (Reuters,
BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured
Data block at the bottom of -f llm output; this iter teaches it to
parse the JSON-LD by schema and surface it usefully.

New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies
items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review,
WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is
auto-lifted (Reuters CollectionPage shape).

Two new CLI flags:

--prefer-structured: surfaces the schema-aware block at the TOP of the
output, before prose. For -f llm emits a Markdown summary block; for
-f json emits a {structured, extracted} envelope. Bypasses the default
DROP list for WebPage/chrome types when explicitly requested.

--articles-from-jsonld: when the page contains ItemList or
LiveBlogPosting, output ONLY a JSON array of articles
({position, title, url, published}). When no such schema is present,
emit a stderr hint and fall through to default extraction (no error).

Default behavior (neither flag set) byte-identical to iter-3 on all
default-flag probes (regression sentinel passed): Cyrillic p14 still
7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical,
M3 registry p44/p45/p46 still fast-fail with exit 67.

14 new tests in webclaw-core covering schema-variant parsing, parse
error handling, fall-through behavior, flag combinations, and the
default-byte-identical sentinel. Workspace tests 657 -> 671.
2026-05-23 20:38:59 +02:00
devnen
e28b22adf7 feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls
Sites known to require CAPTCHA-solving (Cloudflare interstitials) or
browser-side ad-blocker bypass (JS+adblock walls like Liberation) cannot
be reached by webclaw's chrome impersonation; they return interstitial
stubs ('Just a moment...', 'Please enable JS and disable any ad blocker')
with 0 useful content. Currently each call wastes 5-10s on the timeout
before the caller sees the failure.

New registry under crates/webclaw-fetch/src/known_bad_sites.rs lists
known bad hosts with a category (CloudflareInterstitial / AdblockWall)
and suggested substitute domains. Host matching: lowercase + strip
leading 'www.' + exact-match against registered host.

On registry hit, webclaw writes 'error: <host> is <category>-walled;
suggested substitute: <alt1>, <alt2>' to stderr and exits with code 67
(EX_NOHOST), BEFORE making any network call. wall_ms drops from ~5000
to <50 for listed hosts.

Initial entries: ambito.com (Cloudflare; substitutes cronista.com,
iprofesional.com), liberation.fr (adblock; substitutes lemonde.fr,
lepoint.fr). WSJ/FT/Bloomberg/NYT are NOT included -- those are
subscription paywalls with different bypass semantics; deferred to M11.

10 new tests in webclaw-fetch covering host normalization, www
stripping, path-under-host matching, case insensitivity, unknown-domain
pass-through, and the formatted error message (9 unit + 1 fetch-layer
integration). Workspace test total 647 -> 657.
2026-05-23 19:42:15 +02:00
devnen
31a8f6150f feat(core): JS-hub page detector + --prefer-articles flag
Detects ESPN-style hub pages (espn.com/nba/, /nfl/, /mlb/, /nhl/, /soccer/)
where the rendered markup has nav-only content with no article bodies —
chrome retry doesn't help because the data genuinely isn't in the markup.
Heuristic: word_count < 500 AND link_count >= 5 against the extracted output.

--prefer-articles: when set, a hub-classified page returns the extracted
link list (reusing the M1 --mode summary machinery) instead of the sparse
body. On non-hub pages, behavior is unchanged.

stderr hint: always emitted on hub detection so the caller knows to drill
/story/_/id/<id>/ URLs from a citation list.

False-positive resistance verified: BBC News /world (link-heavy aggregator,
1500+ words body) and n1info.rs (widget-heavy but content-rich) both
classify as non-hub and emit full extraction.

9 new tests in webclaw-core (317 -> 326).
2026-05-23 18:55:17 +02:00
devnen
339f41bb7c feat(cli): add --max-output-bytes and --mode summary,toc for output-size control
Three additive CLI flags addressing the 50KB persisted-output cap that
trips Claude Code's per-tool-result harness on aggregator front pages
(apnews.com, cnbc.com/markets/, b92.net all >50KB by default):

--max-output-bytes N: truncates final output at N bytes with a clear
'[truncated: M more bytes ...]' footer. N=0 means unlimited (default).
UTF-8 codepoint-boundary safe; also wraps JSON output so truncated
output stays parseable.

--mode summary: returns only the extracted link list (titles + URLs),
no body text. For aggregator front pages where the LLM is going to
drill the individual articles next anyway.

--mode toc: returns H1/H2 outline + first paragraph after each H2.
For long single-article pages.

New flags are orthogonal to -f (json/llm/text). 9 new unit tests in
webclaw-core, total goes 308 -> 317 passing. Smoke-tested on
apnews.com (51713 -> 27404 summary -> 6269 toc -> 8193 capped),
pitchfork.com (42049 -> 379 summary), cnbc.com (56682 -> 16385 capped).
2026-05-23 18:17:42 +02:00
devnen
562c6a15f0 gitignore: cover improve-loop and local build artifacts
improve-loop's loop.py writes baselines/, .loop-scratch/ and a
*-loop-progress.log per run. _build-release.bat / _build-release.log are
a local wrapper for invoking cargo build with the right MSVC + LLVM +
NASM env (replaces the missing update.py from CLAUDE.md). None should
land in git.
2026-05-23 17:42:05 +02:00
devnen
e620173d3a docs+gitignore: portable-install sync note and local scratch ignores
CLAUDE.md gains a mandatory step at the top describing the rebuild->copy->
verify dance for the portable Claude Code install at C:\_projects\claude-
portable, plus a local-build env snippet for the BoringSSL bindgen vars
(LIBCLANG_PATH, NASM on PATH) that update.py sets automatically but a plain
shell does not.

.gitignore adds runtime/scratch entries that shouldn't have been tracked:
__pycache__/, .last_update_check, .playwright-cli/, demo_sample.html,
demo_saved.json. Nothing currently tracked is affected (none of these were
under version control).
2026-05-23 17:29:05 +02:00
Valerio
8fe8bcb479 chore(ci): bump actions/checkout and artifact actions to v5
GitHub flagged checkout@v4 / upload-artifact@v4 / download-artifact@v4
as Node.js 20 actions, force-migrated to Node 24 on 2026-06-02. Bump
all nine references to v5 ahead of the deadline. The artifact steps are
v5-compatible: upload uses a unique matrix-target name and the download
step flattens subdirectories with find afterward.
2026-05-21 15:11:29 +02:00
Valerio
51260ae4e3 chore(release): record v0.6.4 version bump and changelog
The v0.6.4 tag shipped the API surface discovery module but the
release commit left the workspace version at 0.6.3 with no matching
changelog entry. Bump [workspace.package] to 0.6.4 and add the
[0.6.4] CHANGELOG section so the code matches the tag.
2026-05-21 12:58:47 +02:00
Valerio
fe567a6af1
feat(core): endpoints module for API surface extraction from HTML and JS (#47)
* feat(core): endpoints module — extract API surface from HTML + JS bundles

* fix(docker): source CA bundle from distroless instead of apt (fixes arm64 release build)

* fix(test): serialize env-mutating CloudClient tests to stop flaky CI

* feat(core): filter endpoint-extractor noise (invalid hosts, schema domains, bare paths)
2026-05-19 19:05:16 +02:00
Valerio
be8bcfebd9
fix: harden resource limits, path safety, and WASM build (#46)
Security audit follow-up across the workspace:

- webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a
  cfg(not(wasm32)) target dependency and the extraction entry point uses
  a direct call on wasm instead of spawning a thread, so it builds and
  runs on wasm32 with or without default features.
- webclaw-core: bound the structured-data scrubber recursion (depth cap)
  so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the
  stack.
- webclaw-fetch: stream the response body with a running ceiling so a
  small highly compressed payload cannot inflate to gigabytes in memory;
  redact user:pass@ from proxy URLs before they reach error strings.
- webclaw-cli: contain output filenames inside the chosen directory
  (reject .. / absolute, drop traversal path segments), run --webhook
  URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s,
  and make research slug truncation char-safe.
- webclaw-mcp: char-safe slug truncation (no multibyte slice panic).
- setup.sh / deploy/hetzner.sh: replace eval on read input with
  printf -v, and mask auth key / API token in console output.
- CI: enforce the wasm32 build invariant for webclaw-core.

Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.
2026-05-19 17:03:52 +02:00
Valerio
aab51bea91 docs: add workflow examples 2026-05-18 18:56:00 +02:00
Valerio
b75b768ec3 Update Quantum Proxies sponsor copy 2026-05-18 18:50:38 +02:00
Valerio
3fabdc1d02
fix: clean llm output noise
Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.
2026-05-18 18:39:33 +02:00
Valerio
5eef8358b0 docs: update sponsor partner details 2026-05-18 13:09:02 +02:00
Valerio
7dfd62ec1d docs: add proxy-seller studio partner 2026-05-18 12:37:28 +02:00
Valerio
6d886c44f6 docs: enlarge studio partner banner 2026-05-18 12:27:11 +02:00
Valerio
8e3ad17428 docs: tighten studio partner layout 2026-05-18 12:23:19 +02:00
Valerio
7321549412 docs: add studio partner section 2026-05-18 12:17:34 +02:00
Valerio
72edb61881
Merge pull request #42 from jal-co/docs/add-community-plugins
docs: add community plugins section
2026-05-16 11:24:33 +02:00
Valerio
00d86a12bc docs: refine community plugin copy 2026-05-16 11:19:15 +02:00
Justin Levine
c8be5214f6
docs: add community plugins section with OpenClaw and Hermes integrations 2026-05-15 17:51:22 -07:00
Valerio
0ea189c5b2 fix(ci): pass repository to release cli
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
2026-05-12 12:28:14 +02:00
Valerio
a629534490
fix(security): prepare 0.6.1 hardening
Merge the 0.6.1 security hardening release candidate after local and CI verification.
2026-05-12 12:16:42 +02:00
Valerio
fd2e75d509 chore(fetch): satisfy clippy for resolver setup 2026-05-12 12:09:18 +02:00
Valerio
e2f89941ac chore(release): prepare 0.6.1 2026-05-12 12:06:06 +02:00
Valerio
307b4f980d fix(extractors): harden marketplace host matching 2026-05-12 12:03:43 +02:00
Valerio
dbf9ce08a6 fix(ci): scope release workflow token permissions 2026-05-12 12:00:47 +02:00
Valerio
3bcb288d13 fix(fetch): guard challenge detection before utf8 decoding 2026-05-12 12:00:47 +02:00
Valerio
a611ae26f3 fix(security): harden local fetch surfaces 2026-05-12 12:00:25 +02:00
Valerio
af96628dc9
Revise README for clarity and updated content
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
Updated the README to reflect changes in the project description, banner image size, and various content sections. Enhanced clarity on features and usage.
2026-05-10 22:44:57 +02:00
devnen
e8ca1417d6
Improve --format llm output quality (#37)
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
Improve LLM-format output for modern news and documentation pages.

- Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records
- Fix element/text spacing without detaching punctuation on docs, forums, and reference pages
- Remove common accessibility link chrome from LLM text and link labels
- Bump workspace version to 0.6.0 and update the changelog

Thanks to Nenad Oric (@devnen) for the original PR and contribution.
2026-05-10 15:11:12 +02:00
Valerio
7f75143954 docs: update hosted api trial copy
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
2026-05-06 17:16:35 +02:00
Valerio
e6a95f783d chore: bump version to 0.5.9
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-05-06 11:42:09 +02:00
Valerio
a3aa4bce6f fix: support LLM provider compatibility options
Closes #36
2026-05-06 11:36:53 +02:00
Valerio
86183b11e4 docs: credit Windows release contribution
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-05-05 11:44:07 +02:00
SURYANSH MISHRA
513b0e493e ci: add Windows release artifacts
Closes #34
2026-05-05 11:38:30 +02:00
Valerio
a1242a1c1d docs: credit README badge refresh 2026-05-05 11:18:58 +02:00
Justin Levine
a542e45768
docs: refresh README badges
Replace README badges with shieldcn-styled badges.
2026-05-05 11:17:21 +02:00
Valerio
615f326660 docs: update changelog for brand extraction
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-05-04 21:52:49 +02:00
Valerio
72b8dbc285 fix: improve brand extraction signals 2026-05-04 21:25:07 +02:00
Valerio
1c9def2fde fix: validate self-host route URLs consistently 2026-05-04 14:30:06 +02:00
Valerio
eede2f6953 docs: credit SSRF report
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
2026-05-04 12:08:11 +02:00
Valerio
bdf81fe6bf fix: harden fetch URL validation 2026-05-04 11:50:57 +02:00
Valerio
23544f8fac docs(claude): note youtube.rs role and yt-dlp short-circuit in server
Some checks are pending
CI / Test (push) Waiting to run
CI / Lint (push) Waiting to run
CI / Docs (push) Waiting to run
The webclaw-core youtube module produces structured markdown but no
transcript; document that and point at the production server's
youtube_transcript.rs short-circuit for the full YoutubeData + caption
text shape.
2026-05-03 21:17:23 +02:00
Valerio
923445f4a8 docs(readme): add h1 brand heading
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
The repo had no heading-level brand anchor, only a banner image and
an h3 slogan. Search engines indexing the README were missing the
canonical brand signal. The new h1 is what GitHub renders as the
title of the page and what Google co-ranks with webclaw.io.

Bumps workspace version to 0.5.7.
2026-04-30 11:47:02 +02:00
Valerio
0e6c7cdc97
Add GitHub Sponsors username to FUNDING.yml
Some checks failed
CI / Test (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docs (push) Has been cancelled
Updated funding model with GitHub Sponsors username.
2026-04-27 13:18:22 +02:00