webclaw

mirror of https://github.com/0xMassi/webclaw.git synced 2026-06-09 22:35:12 +02:00

Author	SHA1	Message	Date
devnen	d5a3aa4bf9	feat(core): word-count breakdown in header — article vs chrome split Current Word count: N is a single number conflating article body and surrounding chrome (nav, ads, footer). Callers couldn't tell from the header alone whether to drill or move on. New: Word count: N (article: M, chrome: K) in -f llm/text output. For -f json: adds word_count_article and word_count_chrome fields alongside the existing word_count. M (article body) is sourced from JSON-LD articleBody when M4's parser found one (NewsArticle or Review.reviewBody); otherwise computed by llm::body_word_count (the M2-style heuristic — words outside markdown link patterns, the same body::process_body output hub_detect uses). --mode summary / toc / sections fall back to the simple Word count: N form (the modes don't extract body content; the breakdown would be meaningless). Suppression piggybacks on the existing include_status toggle in build_metadata_header_with_opts. 9 new tests in webclaw-core (4 in lib.rs::tests for the population logic; 5 in llm/metadata.rs::m12_tests for the header formatter). Workspace 701 -> 710.	2026-05-23 23:56:14 +02:00
devnen	ade2a5143c	feat(core): --mode sections for nav-URL discovery Section-URL ambiguity is recurring friction — callers have to guess whether to hit infobae.com root (LATAM frontpage) or /economia/ (AR- specific live FX dashboard), or decrypt.co root (ticker ribbon) vs /news/ (article list), or bbc.com/news/world vs /news/world/europe/. Each guess costs a round-trip. New `--mode sections` returns the discoverable section URLs parsed from the page's nav, in one round-trip. Subsumes issue #16 (non- English nav harder to LLM-parse — sections come back as data, not prose). Multi-signal heuristic on the existing link extraction: URL-pattern match (/<category>/ style short paths), repetition (section links appear in header + footer), DOM-position when available. Fallback when zero sections detected: emit top-N links with a "(none detected; first N shown)" note. Format: -f llm/text emits `Sections:` followed by `- [Label](url)` list. -f json emits `{"sections": [{"label": "...", "url": "..."}]}`. 13 new tests in webclaw-core (688 -> 701).	2026-05-23 23:14:40 +02:00
devnen	76cd515a3e	feat(core): thin-body classifier + stderr hint for JS-walled content-heavy sites On sites like Hollywood Reporter where the extracted body is < 500 words because the page is JS-walled (chrome rendering is needed), webclaw now emits a one-line stderr hint: # hint: extracted body is N words (thin); the page may be JS-walled. Try --browser chrome for JS-rendered content. Thin-body classification (crates/webclaw-core/src/llm/thin_body.rs) mirrors the M2 hub-detector structure. Threshold: 500 words. Exemption list for utility domains (example.com, httpbin.org, etc) where thinness is by design. The originally proposed --retry-thin flag was dropped after phase A determined webclaw has no headless-JS backend to retry to (--browser only affects User-Agent impersonation, not actual rendering). The hint-only design lets the caller decide: re-run with --browser chrome manually, or switch to a different fetcher entirely. Hint suppressed in --mode summary / --mode toc (link/outline focused); M3 fast-fails skip the formatter entirely so no hint. Stdout invariance: tested byte-identical on all p01-p15 default probes. M10 only modifies stderr. 10 new tests (workspace 678 -> 688).	2026-05-23 22:18:12 +02:00
devnen	dfcd51d9e0	feat(core): HTTP status header line in -f llm/text/json output Webclaw previously emitted URL, Title, Description, and Word count in the -f llm header but no HTTP status. On a 404 response, the caller had no signal apart from inspecting the body (e.g. dailysabah.com/ business/economy returns a 404 page; webclaw was extracting '13 words' of the error page without flagging the 404 status). New behavior: every -f llm/text/json output includes a 'Status: <code>' header line (after URL: per phase A's placement). Emitted on all responses including 200 for consistency — callers can't otherwise distinguish 'webclaw saw 200' from 'webclaw missed status info'. For -f json: top-level "status": <code> field added. Modes --mode summary and --mode toc are exempt: the status line would clutter the link-list and outline outputs. M3 fast-fails (known-bad-sites) also skip the status line because they exit before the formatter is reached. 7 new tests in webclaw-core (workspace total 671 -> 678).	2026-05-23 21:29:26 +02:00
devnen	66974366d7	feat(core): schema-aware JSON-LD parser + --prefer-structured + --articles-from-jsonld JSON-LD is consistently the cleanest source on major outlets (Reuters, BBC, Le Monde, N1, Pitchfork). Webclaw already emitted a raw Structured Data block at the bottom of -f llm output; this iter teaches it to parse the JSON-LD by schema and surface it usefully. New schema-aware parser at crates/webclaw-core/src/jsonld.rs classifies items by @type into: ItemList, LiveBlogPosting, NewsArticle, Review, WebPageOrChrome, Unknown. CollectionPage with mainEntity ItemList is auto-lifted (Reuters CollectionPage shape). Two new CLI flags: --prefer-structured: surfaces the schema-aware block at the TOP of the output, before prose. For -f llm emits a Markdown summary block; for -f json emits a {structured, extracted} envelope. Bypasses the default DROP list for WebPage/chrome types when explicitly requested. --articles-from-jsonld: when the page contains ItemList or LiveBlogPosting, output ONLY a JSON array of articles ({position, title, url, published}). When no such schema is present, emit a stderr hint and fall through to default extraction (no error). Default behavior (neither flag set) byte-identical to iter-3 on all default-flag probes (regression sentinel passed): Cyrillic p14 still 7735 B, M1 caps p18/p19/p20 deterministic, M2 hub p40/p41 byte-identical, M3 registry p44/p45/p46 still fast-fail with exit 67. 14 new tests in webclaw-core covering schema-variant parsing, parse error handling, fall-through behavior, flag combinations, and the default-byte-identical sentinel. Workspace tests 657 -> 671.	2026-05-23 20:38:59 +02:00
devnen	e28b22adf7	feat(fetch): known-bad-sites registry for fast-fail on Cloudflare / adblock walls Sites known to require CAPTCHA-solving (Cloudflare interstitials) or browser-side ad-blocker bypass (JS+adblock walls like Liberation) cannot be reached by webclaw's chrome impersonation; they return interstitial stubs ('Just a moment...', 'Please enable JS and disable any ad blocker') with 0 useful content. Currently each call wastes 5-10s on the timeout before the caller sees the failure. New registry under crates/webclaw-fetch/src/known_bad_sites.rs lists known bad hosts with a category (CloudflareInterstitial / AdblockWall) and suggested substitute domains. Host matching: lowercase + strip leading 'www.' + exact-match against registered host. On registry hit, webclaw writes 'error: <host> is <category>-walled; suggested substitute: <alt1>, <alt2>' to stderr and exits with code 67 (EX_NOHOST), BEFORE making any network call. wall_ms drops from ~5000 to <50 for listed hosts. Initial entries: ambito.com (Cloudflare; substitutes cronista.com, iprofesional.com), liberation.fr (adblock; substitutes lemonde.fr, lepoint.fr). WSJ/FT/Bloomberg/NYT are NOT included -- those are subscription paywalls with different bypass semantics; deferred to M11. 10 new tests in webclaw-fetch covering host normalization, www stripping, path-under-host matching, case insensitivity, unknown-domain pass-through, and the formatted error message (9 unit + 1 fetch-layer integration). Workspace test total 647 -> 657.	2026-05-23 19:42:15 +02:00
devnen	31a8f6150f	feat(core): JS-hub page detector + --prefer-articles flag Detects ESPN-style hub pages (espn.com/nba/, /nfl/, /mlb/, /nhl/, /soccer/) where the rendered markup has nav-only content with no article bodies — chrome retry doesn't help because the data genuinely isn't in the markup. Heuristic: word_count < 500 AND link_count >= 5 against the extracted output. --prefer-articles: when set, a hub-classified page returns the extracted link list (reusing the M1 --mode summary machinery) instead of the sparse body. On non-hub pages, behavior is unchanged. stderr hint: always emitted on hub detection so the caller knows to drill /story/_/id/<id>/ URLs from a citation list. False-positive resistance verified: BBC News /world (link-heavy aggregator, 1500+ words body) and n1info.rs (widget-heavy but content-rich) both classify as non-hub and emit full extraction. 9 new tests in webclaw-core (317 -> 326).	2026-05-23 18:55:17 +02:00
devnen	339f41bb7c	feat(cli): add --max-output-bytes and --mode summary,toc for output-size control Three additive CLI flags addressing the 50KB persisted-output cap that trips Claude Code's per-tool-result harness on aggregator front pages (apnews.com, cnbc.com/markets/, b92.net all >50KB by default): --max-output-bytes N: truncates final output at N bytes with a clear '[truncated: M more bytes ...]' footer. N=0 means unlimited (default). UTF-8 codepoint-boundary safe; also wraps JSON output so truncated output stays parseable. --mode summary: returns only the extracted link list (titles + URLs), no body text. For aggregator front pages where the LLM is going to drill the individual articles next anyway. --mode toc: returns H1/H2 outline + first paragraph after each H2. For long single-article pages. New flags are orthogonal to -f (json/llm/text). 9 new unit tests in webclaw-core, total goes 308 -> 317 passing. Smoke-tested on apnews.com (51713 -> 27404 summary -> 6269 toc -> 8193 capped), pitchfork.com (42049 -> 379 summary), cnbc.com (56682 -> 16385 capped).	2026-05-23 18:17:42 +02:00
devnen	562c6a15f0	gitignore: cover improve-loop and local build artifacts improve-loop's loop.py writes baselines/, .loop-scratch/ and a *-loop-progress.log per run. _build-release.bat / _build-release.log are a local wrapper for invoking cargo build with the right MSVC + LLVM + NASM env (replaces the missing update.py from CLAUDE.md). None should land in git.	2026-05-23 17:42:05 +02:00
devnen	e620173d3a	docs+gitignore: portable-install sync note and local scratch ignores CLAUDE.md gains a mandatory step at the top describing the rebuild->copy-> verify dance for the portable Claude Code install at C:\_projects\claude- portable, plus a local-build env snippet for the BoringSSL bindgen vars (LIBCLANG_PATH, NASM on PATH) that update.py sets automatically but a plain shell does not. .gitignore adds runtime/scratch entries that shouldn't have been tracked: __pycache__/, .last_update_check, .playwright-cli/, demo_sample.html, demo_saved.json. Nothing currently tracked is affected (none of these were under version control).	2026-05-23 17:29:05 +02:00
Valerio	8fe8bcb479	chore(ci): bump actions/checkout and artifact actions to v5 GitHub flagged checkout@v4 / upload-artifact@v4 / download-artifact@v4 as Node.js 20 actions, force-migrated to Node 24 on 2026-06-02. Bump all nine references to v5 ahead of the deadline. The artifact steps are v5-compatible: upload uses a unique matrix-target name and the download step flattens subdirectories with find afterward.	2026-05-21 15:11:29 +02:00
Valerio	51260ae4e3	chore(release): record v0.6.4 version bump and changelog The v0.6.4 tag shipped the API surface discovery module but the release commit left the workspace version at 0.6.3 with no matching changelog entry. Bump [workspace.package] to 0.6.4 and add the [0.6.4] CHANGELOG section so the code matches the tag.	2026-05-21 12:58:47 +02:00
Valerio	fe567a6af1	feat(core): endpoints module for API surface extraction from HTML and JS (#47 ) * feat(core): endpoints module — extract API surface from HTML + JS bundles * fix(docker): source CA bundle from distroless instead of apt (fixes arm64 release build) * fix(test): serialize env-mutating CloudClient tests to stop flaky CI * feat(core): filter endpoint-extractor noise (invalid hosts, schema domains, bare paths)	2026-05-19 19:05:16 +02:00
Valerio	be8bcfebd9	fix: harden resource limits, path safety, and WASM build (#46 ) Security audit follow-up across the workspace: - webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a cfg(not(wasm32)) target dependency and the extraction entry point uses a direct call on wasm instead of spawning a thread, so it builds and runs on wasm32 with or without default features. - webclaw-core: bound the structured-data scrubber recursion (depth cap) so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the stack. - webclaw-fetch: stream the response body with a running ceiling so a small highly compressed payload cannot inflate to gigabytes in memory; redact user:pass@ from proxy URLs before they reach error strings. - webclaw-cli: contain output filenames inside the chosen directory (reject .. / absolute, drop traversal path segments), run --webhook URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s, and make research slug truncation char-safe. - webclaw-mcp: char-safe slug truncation (no multibyte slice panic). - setup.sh / deploy/hetzner.sh: replace eval on read input with printf -v, and mask auth key / API token in console output. - CI: enforce the wasm32 build invariant for webclaw-core. Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.	2026-05-19 17:03:52 +02:00
Valerio	aab51bea91	docs: add workflow examples	2026-05-18 18:56:00 +02:00
Valerio	b75b768ec3	Update Quantum Proxies sponsor copy	2026-05-18 18:50:38 +02:00
Valerio	3fabdc1d02	fix: clean llm output noise Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.	2026-05-18 18:39:33 +02:00
Valerio	5eef8358b0	docs: update sponsor partner details	2026-05-18 13:09:02 +02:00
Valerio	7dfd62ec1d	docs: add proxy-seller studio partner	2026-05-18 12:37:28 +02:00
Valerio	6d886c44f6	docs: enlarge studio partner banner	2026-05-18 12:27:11 +02:00
Valerio	8e3ad17428	docs: tighten studio partner layout	2026-05-18 12:23:19 +02:00
Valerio	7321549412	docs: add studio partner section	2026-05-18 12:17:34 +02:00
Valerio	72edb61881	Merge pull request #42 from jal-co/docs/add-community-plugins docs: add community plugins section	2026-05-16 11:24:33 +02:00
Valerio	00d86a12bc	docs: refine community plugin copy	2026-05-16 11:19:15 +02:00
Justin Levine	c8be5214f6	docs: add community plugins section with OpenClaw and Hermes integrations	2026-05-15 17:51:22 -07:00
Valerio	0ea189c5b2	fix(ci): pass repository to release cli Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled	2026-05-12 12:28:14 +02:00
Valerio	a629534490	fix(security): prepare 0.6.1 hardening Merge the 0.6.1 security hardening release candidate after local and CI verification.	2026-05-12 12:16:42 +02:00
Valerio	fd2e75d509	chore(fetch): satisfy clippy for resolver setup	2026-05-12 12:09:18 +02:00
Valerio	e2f89941ac	chore(release): prepare 0.6.1	2026-05-12 12:06:06 +02:00
Valerio	307b4f980d	fix(extractors): harden marketplace host matching	2026-05-12 12:03:43 +02:00
Valerio	dbf9ce08a6	fix(ci): scope release workflow token permissions	2026-05-12 12:00:47 +02:00
Valerio	3bcb288d13	fix(fetch): guard challenge detection before utf8 decoding	2026-05-12 12:00:47 +02:00
Valerio	a611ae26f3	fix(security): harden local fetch surfaces	2026-05-12 12:00:25 +02:00
Valerio	af96628dc9	Revise README for clarity and updated content Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled Updated the README to reflect changes in the project description, banner image size, and various content sections. Enhanced clarity on features and usage.	2026-05-10 22:44:57 +02:00
devnen	e8ca1417d6	Improve --format llm output quality (#37 ) Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run Improve LLM-format output for modern news and documentation pages. - Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records - Fix element/text spacing without detaching punctuation on docs, forums, and reference pages - Remove common accessibility link chrome from LLM text and link labels - Bump workspace version to 0.6.0 and update the changelog Thanks to Nenad Oric (@devnen) for the original PR and contribution.	2026-05-10 15:11:12 +02:00
Valerio	7f75143954	docs: update hosted api trial copy Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled	2026-05-06 17:16:35 +02:00
Valerio	e6a95f783d	chore: bump version to 0.5.9 Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-05-06 11:42:09 +02:00
Valerio	a3aa4bce6f	fix: support LLM provider compatibility options Closes #36	2026-05-06 11:36:53 +02:00
Valerio	86183b11e4	docs: credit Windows release contribution Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-05-05 11:44:07 +02:00
SURYANSH MISHRA	513b0e493e	ci: add Windows release artifacts Closes #34	2026-05-05 11:38:30 +02:00
Valerio	a1242a1c1d	docs: credit README badge refresh	2026-05-05 11:18:58 +02:00
Justin Levine	a542e45768	docs: refresh README badges Replace README badges with shieldcn-styled badges.	2026-05-05 11:17:21 +02:00
Valerio	615f326660	docs: update changelog for brand extraction Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-05-04 21:52:49 +02:00
Valerio	72b8dbc285	fix: improve brand extraction signals	2026-05-04 21:25:07 +02:00
Valerio	1c9def2fde	fix: validate self-host route URLs consistently	2026-05-04 14:30:06 +02:00
Valerio	eede2f6953	docs: credit SSRF report Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run	2026-05-04 12:08:11 +02:00
Valerio	bdf81fe6bf	fix: harden fetch URL validation	2026-05-04 11:50:57 +02:00
Valerio	23544f8fac	docs(claude): note youtube.rs role and yt-dlp short-circuit in server Some checks are pending CI / Test (push) Waiting to run CI / Lint (push) Waiting to run CI / Docs (push) Waiting to run The webclaw-core youtube module produces structured markdown but no transcript; document that and point at the production server's youtube_transcript.rs short-circuit for the full YoutubeData + caption text shape.	2026-05-03 21:17:23 +02:00
Valerio	923445f4a8	docs(readme): add h1 brand heading Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled The repo had no heading-level brand anchor, only a banner image and an h3 slogan. Search engines indexing the README were missing the canonical brand signal. The new h1 is what GitHub renders as the title of the page and what Google co-ranks with webclaw.io. Bumps workspace version to 0.5.7.	2026-04-30 11:47:02 +02:00
Valerio	0e6c7cdc97	Add GitHub Sponsors username to FUNDING.yml Some checks failed CI / Test (push) Has been cancelled CI / Lint (push) Has been cancelled CI / Docs (push) Has been cancelled Updated funding model with GitHub Sponsors username.	2026-04-27 13:18:22 +02:00

1 2 3 4

169 commits