When bash splits a URL at & or ? (a common foot-gun), webclaw
receives only the truncated prefix and silently fetches the wrong
page. Per issue #6:
1. Heuristic warning: if the URL ends with '&' or contains '?' with
no '=' after, emit a stderr warning before fetching:
# webclaw: warning: URL looks truncated (ends with '&' or '?'); did the shell split it? Quote the URL or use --url-encoded.
2. New flag --url-encoded: parallel input that asserts the user has
handled escaping. Suppresses the truncation warning since intent
is explicit.
Fetch proceeds in both cases; this is informational only. 4 new
tests in webclaw-cli. Workspace 720 -> 724.
(cherry picked from commit 4ef27fcd33)
Webclaw's default -t timeout is 30s; slow sites previously sat
silently with no feedback. Now during a fetch, every 10s of elapsed
time webclaw writes one line to stderr:
# webclaw: still fetching <URL> (Ns)
Fetches completing in under 10s emit nothing (the timer never fires).
Stdout output is untouched - pure feedback signal on stderr.
No timeout change. No new flags. Default behavior is augmented at
stderr only.
Implemented via tokio::select! between the fetch future and a
tokio::time::interval. Latency cost: a single tokio task spawn
and a 10s tick - microseconds on the fast path.
10 new tests in webclaw-fetch::progress::tests (none ignored; the
slow-future test uses a 50ms test interval to keep cargo test fast).
Workspace total 710 -> 720.
(cherry picked from commit 06f065cb08)
wreq is a release candidate with no API stability between rc.N builds
(rc.29 broke the TLS + Response API). `cargo install` and the release
workflow both ignore Cargo.lock and were re-resolving to rc.29, breaking
the build. An exact `=6.0.0-rc.28` / `=3.0.0-rc.10` pin keeps every build
path deterministic until wreq reaches a stable release.
The TLS layer moved to wreq (BoringSSL) in-process; there is no longer a
[patch.crates-io] section or a separate TLS fork. Update the architecture
tree and crate-boundary notes to match.
Reddit blocked unauthenticated `.json` access, so the previous extractor
returned block pages or timed out on every thread. Switch to parsing
old.reddit.com's server-rendered HTML, which needs no API key or JS.
Fetch layer:
- Rewrite every Reddit host to old.reddit.com before fetching; drop all
`.json` URL handling and the JSON response parser.
Extraction (webclaw-core::reddit):
- New HTML parser producing a typed post + nested comment tree.
- Comments nest structurally (.comment > .child > .sitetable > .comment);
old.reddit omits a usable depth attribute, so the tree is walked
recursively. Bodies live in .entry > form > .usertext-body > .md.
- Post metadata: title, author, subreddit, score, comment count
(data-comments-count), self-vs-link (self class / self.* domain),
flair, self-text body.
- Comment scores read the .score.unvoted title (the displayed value, not
the ±1 vote-state siblings); hidden scores are None, not 0.
- Deleted comments are kept in place so their replies aren't orphaned;
"load more comments" stubs are skipped.
Markdown output:
- Reply nesting via blockquote depth (avoids 4-space indentation turning
text and code fences into broken indented-code blocks).
- Links keep their target as [text](url); root-relative reddit links
resolve against old.reddit.com. Nested lists indent correctly.
- A recognised but unparseable /comments/ page returns no content rather
than falling through to generic extraction of Reddit chrome.
Tests: regression suite runs against real old.reddit.com fixtures
(testdata/reddit/), the ground truth that surfaced the parsing and
markdown bugs synthetic HTML had hidden. Fixtures are excluded from the
published crate.
GitHub flagged checkout@v4 / upload-artifact@v4 / download-artifact@v4
as Node.js 20 actions, force-migrated to Node 24 on 2026-06-02. Bump
all nine references to v5 ahead of the deadline. The artifact steps are
v5-compatible: upload uses a unique matrix-target name and the download
step flattens subdirectories with find afterward.
The v0.6.4 tag shipped the API surface discovery module but the
release commit left the workspace version at 0.6.3 with no matching
changelog entry. Bump [workspace.package] to 0.6.4 and add the
[0.6.4] CHANGELOG section so the code matches the tag.
Security audit follow-up across the workspace:
- webclaw-core: keep the crate WASM-safe. quickjs/rquickjs is now a
cfg(not(wasm32)) target dependency and the extraction entry point uses
a direct call on wasm instead of spawning a thread, so it builds and
runs on wasm32 with or without default features.
- webclaw-core: bound the structured-data scrubber recursion (depth cap)
so deeply nested attacker JSON-LD / __NEXT_DATA__ cannot exhaust the
stack.
- webclaw-fetch: stream the response body with a running ceiling so a
small highly compressed payload cannot inflate to gigabytes in memory;
redact user:pass@ from proxy URLs before they reach error strings.
- webclaw-cli: contain output filenames inside the chosen directory
(reject .. / absolute, drop traversal path segments), run --webhook
URLs through the public-URL SSRF guard, clamp --watch-interval to >=1s,
and make research slug truncation char-safe.
- webclaw-mcp: char-safe slug truncation (no multibyte slice panic).
- setup.sh / deploy/hetzner.sh: replace eval on read input with
printf -v, and mask auth key / API token in console output.
- CI: enforce the wasm32 build invariant for webclaw-core.
Tests added for every behavioral change. Bump to 0.6.3 + CHANGELOG.
Port the valid PR #43 LLM cleanup fixes onto current main without stale branch regressions.\n\nIncludes comment-count link cleanup, bare numeric paragraph cleanup, pagination leftover cleanup, JSON-LD article body scrubbing, clearer CLI consent-wall warnings, and quieter parser logs by default.\n\nThanks to @devnen for the report and patch work.
Updated the README to reflect changes in the project description, banner image size, and various content sections. Enhanced clarity on features and usage.
Improve LLM-format output for modern news and documentation pages.
- Filter noisy hydration and low-value page chrome structured data while preserving content-bearing Schema.org records
- Fix element/text spacing without detaching punctuation on docs, forums, and reference pages
- Remove common accessibility link chrome from LLM text and link labels
- Bump workspace version to 0.6.0 and update the changelog
Thanks to Nenad Oric (@devnen) for the original PR and contribution.
The webclaw-core youtube module produces structured markdown but no
transcript; document that and point at the production server's
youtube_transcript.rs short-circuit for the full YoutubeData + caption
text shape.